Compare commits

...

3 Commits

Author SHA1 Message Date
Yichuan Wang
af5cff01db Revert "[Multi-vector]Add timing instrumentation and multi-dataset support fo…"
This reverts commit 00770aebbb.
2025-12-03 01:09:08 -08:00
Yichuan Wang
00770aebbb [Multi-vector]Add timing instrumentation and multi-dataset support for multi-vector… (#161)
* Add timing instrumentation and multi-dataset support for multi-vector retrieval

- Add timing measurements for search operations (load and core time)
- Increase embedding batch size from 1 to 32 for better performance
- Add explicit memory cleanup with del all_embeddings
- Support loading and merging multiple datasets with different splits
- Add CLI arguments for search method selection (ann/exact/exact-all)
- Auto-detect image field names across different dataset structures
- Print candidate doc counts for performance monitoring

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* update vidore

* reproduce docvqa results

* reproduce docvqa results and add debug file

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-12-03 00:55:42 -08:00
Aakash Suresh
e268392d5b Fix: Prevent duplicate PDF processing when using --file-types .pdf (#179)
Fixes #175

Problem:
When --file-types .pdf is specified, PDFs were being processed twice:
1. Separately with PyMuPDF/pdfplumber extractors
2. Again in the 'other file types' section via SimpleDirectoryReader

This caused duplicate processing and potential conflicts.

Solution:
- Exclude .pdf from other_file_extensions when PDFs are already
  processed separately
- Only load other file types if there are extensions to process
- Prevents duplicate PDF processing

Changes:
- Added logic to filter out .pdf from code_extensions when loading
  other file types if PDFs were processed separately
- Updated SimpleDirectoryReader to use filtered extensions
- Added check to skip loading if no other extensions to process
2025-12-01 13:48:44 -08:00

View File

@@ -1180,6 +1180,11 @@ Examples:
print(f"Warning: Could not process {file_path}: {e}")
# Load other file types with default reader
# Exclude PDFs from code_extensions if they were already processed separately
other_file_extensions = code_extensions
if should_process_pdfs and ".pdf" in code_extensions:
other_file_extensions = [ext for ext in code_extensions if ext != ".pdf"]
try:
# Create a custom file filter function using our PathSpec
def file_filter(
@@ -1195,15 +1200,19 @@ Examples:
except (ValueError, OSError):
return True # Include files that can't be processed
other_docs = SimpleDirectoryReader(
docs_dir,
recursive=True,
encoding="utf-8",
required_exts=code_extensions,
file_extractor={}, # Use default extractors
exclude_hidden=not include_hidden,
filename_as_id=True,
).load_data(show_progress=True)
# Only load other file types if there are extensions to process
if other_file_extensions:
other_docs = SimpleDirectoryReader(
docs_dir,
recursive=True,
encoding="utf-8",
required_exts=other_file_extensions,
file_extractor={}, # Use default extractors
exclude_hidden=not include_hidden,
filename_as_id=True,
).load_data(show_progress=True)
else:
other_docs = []
# Filter documents after loading based on gitignore rules
filtered_docs = []