[Multi-vector]Add timing instrumentation and multi-dataset support for multi-vector… (#161)

* Add timing instrumentation and multi-dataset support for multi-vector retrieval

- Add timing measurements for search operations (load and core time)
- Increase embedding batch size from 1 to 32 for better performance
- Add explicit memory cleanup with del all_embeddings
- Support loading and merging multiple datasets with different splits
- Add CLI arguments for search method selection (ann/exact/exact-all)
- Auto-detect image field names across different dataset structures
- Print candidate doc counts for performance monitoring

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* update vidore

* reproduce docvqa results

* reproduce docvqa results and add debug file

---------

Co-authored-by: Claude <noreply@anthropic.com>

This commit is contained in:

Yichuan Wang

2025-12-03 00:55:42 -08:00

committed by

GitHub

parent e268392d5b

commit 00770aebbb

6 changed files with 2049 additions and 61 deletions

.gitignore

+2 -1

View File

@@ -91,7 +91,8 @@ packages/leann-backend-diskann/third_party/DiskANN/_deps/
 *.meta.json
 *.passages.json
 *.npy
 *.db
 batchtest.py
 tests/__pytest_cache__/
 tests/__pycache__/