[Multi-vector]Add timing instrumentation and multi-dataset support for multi-vector… (#161)

* Add timing instrumentation and multi-dataset support for multi-vector retrieval

- Add timing measurements for search operations (load and core time)
- Increase embedding batch size from 1 to 32 for better performance
- Add explicit memory cleanup with del all_embeddings
- Support loading and merging multiple datasets with different splits
- Add CLI arguments for search method selection (ann/exact/exact-all)
- Auto-detect image field names across different dataset structures
- Print candidate doc counts for performance monitoring

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* update vidore

* reproduce docvqa results

* reproduce docvqa results and add debug file

---------

Co-authored-by: Claude <noreply@anthropic.com>
This commit is contained in:
Yichuan Wang
2025-12-03 00:55:42 -08:00
committed by GitHub
parent e268392d5b
commit 00770aebbb
6 changed files with 2049 additions and 61 deletions

3
.gitignore vendored
View File

@@ -91,7 +91,8 @@ packages/leann-backend-diskann/third_party/DiskANN/_deps/
*.meta.json
*.passages.json
*.npy
*.db
batchtest.py
tests/__pytest_cache__/
tests/__pycache__/