LEANN

Author SHA1 Message Date

Author	SHA1	Message	Date
Yichuan Wang	af5cff01db	Revert "[Multi-vector]Add timing instrumentation and multi-dataset support fo…" This reverts commit `00770aebbb`.	2025-12-03 01:09:08 -08:00
Yichuan Wang	00770aebbb	[Multi-vector]Add timing instrumentation and multi-dataset support for multi-vector… (#161 ) * Add timing instrumentation and multi-dataset support for multi-vector retrieval - Add timing measurements for search operations (load and core time) - Increase embedding batch size from 1 to 32 for better performance - Add explicit memory cleanup with del all_embeddings - Support loading and merging multiple datasets with different splits - Add CLI arguments for search method selection (ann/exact/exact-all) - Auto-detect image field names across different dataset structures - Print candidate doc counts for performance monitoring 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * update vidore * reproduce docvqa results * reproduce docvqa results and add debug file --------- Co-authored-by: Claude <noreply@anthropic.com>	2025-12-03 00:55:42 -08:00
Aakash Suresh	e268392d5b	Fix: Prevent duplicate PDF processing when using --file-types .pdf (#179 ) Fixes #175 Problem: When --file-types .pdf is specified, PDFs were being processed twice: 1. Separately with PyMuPDF/pdfplumber extractors 2. Again in the 'other file types' section via SimpleDirectoryReader This caused duplicate processing and potential conflicts. Solution: - Exclude .pdf from other_file_extensions when PDFs are already processed separately - Only load other file types if there are extensions to process - Prevents duplicate PDF processing Changes: - Added logic to filter out .pdf from code_extensions when loading other file types if PDFs were processed separately - Updated SimpleDirectoryReader to use filtered extensions - Added check to skip loading if no other extensions to process	2025-12-01 13:48:44 -08:00

Yichuan Wang

af5cff01db

Revert "[Multi-vector]Add timing instrumentation and multi-dataset support fo…"

This reverts commit 00770aebbb.

2025-12-03 01:09:08 -08:00

Yichuan Wang

00770aebbb

[Multi-vector]Add timing instrumentation and multi-dataset support for multi-vector… (#161 )

* Add timing instrumentation and multi-dataset support for multi-vector retrieval

- Add timing measurements for search operations (load and core time)
- Increase embedding batch size from 1 to 32 for better performance
- Add explicit memory cleanup with del all_embeddings
- Support loading and merging multiple datasets with different splits
- Add CLI arguments for search method selection (ann/exact/exact-all)
- Auto-detect image field names across different dataset structures
- Print candidate doc counts for performance monitoring

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* update vidore

* reproduce docvqa results

* reproduce docvqa results and add debug file

---------

Co-authored-by: Claude <noreply@anthropic.com>

2025-12-03 00:55:42 -08:00

Aakash Suresh

e268392d5b

Fix: Prevent duplicate PDF processing when using --file-types .pdf (#179 )

Fixes #175

Problem:
When --file-types .pdf is specified, PDFs were being processed twice:
1. Separately with PyMuPDF/pdfplumber extractors
2. Again in the 'other file types' section via SimpleDirectoryReader

This caused duplicate processing and potential conflicts.

Solution:
- Exclude .pdf from other_file_extensions when PDFs are already
  processed separately
- Only load other file types if there are extensions to process
- Prevents duplicate PDF processing

Changes:
- Added logic to filter out .pdf from code_extensions when loading
  other file types if PDFs were processed separately
- Updated SimpleDirectoryReader to use filtered extensions
- Added check to skip loading if no other extensions to process

2025-12-01 13:48:44 -08:00

Compare commits

3 Commits

fix/pdf-duplicate-processing-175-clean .. revert-161-feat/multi-vector-timing-and-dataset-improvements

Diff Content Not Available

Compare commits

3 Commits fix/pdf-duplicate-processing-175-clean .. revert-161-feat/multi-vector-timing-and-dataset-improvements

Diff Content Not Available

3 Commits

fix/pdf-duplicate-processing-175-clean .. revert-161-feat/multi-vector-timing-and-dataset-improvements