docs: data updated

This commit is contained in:
Andy Lee
2025-09-15 19:50:02 -07:00
parent d7011bbea0
commit a0d6857faa
9 changed files with 749 additions and 133 deletions

View File

@@ -1,18 +1,19 @@
# Enron Emails Benchmark
A retrieval-only benchmark for evaluating LEANN search on the Enron email corpus. It mirrors the structure and CLI of the existing FinanceBench and LAION benches, using stage-based evaluation focused on Recall@3.
A comprehensive RAG benchmark for evaluating LEANN search and generation on the Enron email corpus. It mirrors the structure and CLI of the existing FinanceBench and LAION benches, using stage-based evaluation with Recall@3 and generation timing.
- Dataset: Enron email CSV (e.g., Kaggle wcukierski/enron-email-dataset) for passages
- Queries: corbt/enron_emails_sample_questions (filtered for realistic questions)
- Metric: Recall@3 vs FAISS Flat baseline
- Metrics: Recall@3 vs FAISS Flat baseline + Generation evaluation with Qwen3-8B
## Layout
benchmarks/enron_emails/
- setup_enron_emails.py: Prepare passages, build LEANN index, build FAISS baseline
- evaluate_enron_emails.py: Evaluate retrieval recall (Stage 2)
- evaluate_enron_emails.py: Evaluate retrieval recall (Stages 2-5) + generation with Qwen3-8B
- data/: Generated passages, queries, embeddings-related files
- baseline/: FAISS Flat baseline files
- llm_utils.py: LLM utilities for Qwen3-8B generation (in parent directory)
## Quickstart
@@ -41,23 +42,33 @@ Stage 3 uses binary search over complexity to find the minimal value achieving t
4) Index comparison (Stage 4)
python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 4 --max-queries 100 --output results.json
python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 4 --complexity 88 --max-queries 100 --output results.json
5) Generation evaluation (Stage 5)
python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 5 --complexity 88 --llm-backend hf --model-name Qwen/Qwen3-8B
6) Combined index + generation evaluation (Stages 4+5, recommended)
python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 45 --complexity 88 --llm-backend hf
Notes:
- Minimal CLI: you can run from repo root with only `--index`, defaults match financebench/laion patterns:
- `--stage` defaults to `all` (runs 2, 3, 4)
- `--stage` defaults to `all` (runs 2, 3, 4, 5)
- `--baseline-dir` defaults to `baseline`
- `--queries` defaults to `data/evaluation_queries.jsonl` (or falls back to the index directory)
- `--llm-backend` defaults to `hf` (HuggingFace), can use `vllm`
- `--model-name` defaults to `Qwen/Qwen3-8B`
- Fail-fast behavior: no silent fallbacks. If compact index cannot run with recompute, it errors out.
4) Index comparison (Stage 4)
python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 4 --max-queries 100 --output results.json
- Stage 5 requires Stage 4 retrieval results. Use `--stage 45` to run both efficiently.
Optional flags:
- --queries data/evaluation_queries.jsonl (custom queries file)
- --baseline-dir baseline (where FAISS baseline lives)
- --complexity 64 (LEANN complexity parameter)
- --complexity 88 (LEANN complexity parameter, optimal for 90% recall)
- --llm-backend hf|vllm (LLM backend for generation)
- --model-name Qwen/Qwen3-8B (LLM model for generation)
- --max-queries 1000 (limit number of queries for evaluation)
## Files Produced
- data/enron_passages_preview.jsonl: Small preview of passages used (for inspection)
@@ -66,8 +77,9 @@ Optional flags:
- data/evaluation_queries.jsonl: Query file (id + query; includes GT IDs for reference)
## Notes
- We only evaluate retrieval Recall@3 (no generation). This matches the other benches style and stage flow.
- Evaluates both retrieval Recall@3 and generation timing with Qwen3-8B thinking model.
- The emails CSV must contain a column named "message" (raw RFC822 email) and a column named "file" for source identifier. Message-ID headers are parsed as canonical message IDs when present.
- Qwen3-8B requires special handling for thinking models with chat templates and <think></think> tag processing.
## Stages Summary
@@ -80,16 +92,23 @@ Optional flags:
- Stage 4 (Index Comparison):
- Reports .index-only sizes for compact vs non-compact.
- Measures timings on 100 queries by default: non-compact (no recompute) vs compact (with recompute).
- Measures timings on queries by default: non-compact (no recompute) vs compact (with recompute).
- Stores retrieval results for Stage 5 generation evaluation.
- Fails fast if compact recompute cannot run.
- If `--complexity` is not provided, the script tries to use the best complexity from Stage 3:
- First from the current run (when running `--stage all`), otherwise
- From `enron_stage3_results.json` saved next to the index during the last Stage 3 run.
- If neither exists, Stage 4 will error and ask you to run Stage 3 or pass `--complexity`.
- Stage 5 (Generation Evaluation):
- Uses Qwen3-8B thinking model for RAG generation on retrieved documents from Stage 4.
- Supports HuggingFace (`hf`) and vLLM (`vllm`) backends.
- Measures generation timing separately from search timing.
- Requires Stage 4 results (no additional searching performed).
## Example Results
These are sample results obtained on a subset of Enron data using all-mpnet-base-v2.
These are sample results obtained on Enron data using all-mpnet-base-v2 and Qwen3-8B.
- Stage 3 (Binary Search):
- Minimal complexity achieving 90% Recall@3: 88
@@ -103,14 +122,20 @@ These are sample results obtained on a subset of Enron data using all-mpnet-base
- C=256 → 92.0% Recall@3
- Stage 4 (Index Sizes, .index only):
- Compact: ~2.17 MB
- Non-compact: ~82.03 MB
- Storage saving by compact: ~97.35%
- Compact: ~2.2 MB
- Non-compact: ~82.0 MB
- Storage saving by compact: ~97.3%
- Stage 4 (Timing, 100 queries, complexity=88):
- Non-compact (no recompute): ~0.0074 s avg per query
- Compact (with recompute): ~1.947 s avg per query
- Stage 4 (Search Timing, 988 queries, complexity=88):
- Non-compact (no recompute): ~0.0075 s avg per query
- Compact (with recompute): ~1.981 s avg per query
- Speed ratio (non-compact/compact): ~0.0038x
Full JSON output for Stage 4 is saved by the script (see `--output`), e.g.:
`benchmarks/enron_emails/results_enron_stage4.json`.
- Stage 5 (RAG Generation, 988 queries, Qwen3-8B):
- Average generation time: ~22.302 s per query
- Total queries processed: 988
- LLM backend: HuggingFace transformers
- Model: Qwen/Qwen3-8B (thinking model with <think></think> processing)
Full JSON output is saved by the script (see `--output`), e.g.:
`benchmarks/enron_emails/results_enron_stage45.json`.