Files
2025-09-15 19:50:02 -07:00
..
2025-08-25 15:48:04 -07:00
2025-09-15 19:50:02 -07:00
2025-09-15 19:50:02 -07:00
2025-08-24 23:06:57 -07:00

Enron Emails Benchmark

A comprehensive RAG benchmark for evaluating LEANN search and generation on the Enron email corpus. It mirrors the structure and CLI of the existing FinanceBench and LAION benches, using stage-based evaluation with Recall@3 and generation timing.

  • Dataset: Enron email CSV (e.g., Kaggle wcukierski/enron-email-dataset) for passages
  • Queries: corbt/enron_emails_sample_questions (filtered for realistic questions)
  • Metrics: Recall@3 vs FAISS Flat baseline + Generation evaluation with Qwen3-8B

Layout

benchmarks/enron_emails/

  • setup_enron_emails.py: Prepare passages, build LEANN index, build FAISS baseline
  • evaluate_enron_emails.py: Evaluate retrieval recall (Stages 2-5) + generation with Qwen3-8B
  • data/: Generated passages, queries, embeddings-related files
  • baseline/: FAISS Flat baseline files
  • llm_utils.py: LLM utilities for Qwen3-8B generation (in parent directory)

Quickstart

  1. Prepare the data and index

cd benchmarks/enron_emails python setup_enron_emails.py --data-dir data

Notes:

  • If --emails-csv is omitted, the script attempts to download from Kaggle dataset wcukierski/enron-email-dataset using Kaggle API (requires KAGGLE_USERNAME and KAGGLE_KEY). Alternatively, pass a local path to --emails-csv.

Notes:

  • The script parses emails, chunks header/body into passages, builds a compact LEANN index, and then builds a FAISS Flat baseline from the same passages and embedding model.
  • Optionally, it will also create evaluation queries from HuggingFace dataset corbt/enron_emails_sample_questions.
  1. Run recall evaluation (Stage 2)

python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 2

  1. Complexity sweep (Stage 3)

python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 3 --target-recall 0.90 --max-queries 200

Stage 3 uses binary search over complexity to find the minimal value achieving the target Recall@3 (assumes recall is non-decreasing with complexity). The search expands the upper bound as needed and snaps complexity to multiples of 8.

  1. Index comparison (Stage 4)

python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 4 --complexity 88 --max-queries 100 --output results.json

  1. Generation evaluation (Stage 5)

python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 5 --complexity 88 --llm-backend hf --model-name Qwen/Qwen3-8B

  1. Combined index + generation evaluation (Stages 4+5, recommended)

python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 45 --complexity 88 --llm-backend hf

Notes:

  • Minimal CLI: you can run from repo root with only --index, defaults match financebench/laion patterns:
    • --stage defaults to all (runs 2, 3, 4, 5)
    • --baseline-dir defaults to baseline
    • --queries defaults to data/evaluation_queries.jsonl (or falls back to the index directory)
    • --llm-backend defaults to hf (HuggingFace), can use vllm
    • --model-name defaults to Qwen/Qwen3-8B
  • Fail-fast behavior: no silent fallbacks. If compact index cannot run with recompute, it errors out.
  • Stage 5 requires Stage 4 retrieval results. Use --stage 45 to run both efficiently.

Optional flags:

  • --queries data/evaluation_queries.jsonl (custom queries file)
  • --baseline-dir baseline (where FAISS baseline lives)
  • --complexity 88 (LEANN complexity parameter, optimal for 90% recall)
  • --llm-backend hf|vllm (LLM backend for generation)
  • --model-name Qwen/Qwen3-8B (LLM model for generation)
  • --max-queries 1000 (limit number of queries for evaluation)

Files Produced

  • data/enron_passages_preview.jsonl: Small preview of passages used (for inspection)
  • data/enron_index_hnsw.leann.*: LEANN index files
  • baseline/faiss_flat.index + baseline/metadata.pkl: FAISS baseline with passage IDs
  • data/evaluation_queries.jsonl: Query file (id + query; includes GT IDs for reference)

Notes

  • Evaluates both retrieval Recall@3 and generation timing with Qwen3-8B thinking model.
  • The emails CSV must contain a column named "message" (raw RFC822 email) and a column named "file" for source identifier. Message-ID headers are parsed as canonical message IDs when present.
  • Qwen3-8B requires special handling for thinking models with chat templates and tag processing.

Stages Summary

  • Stage 2 (Recall@3):

    • Compares LEANN vs FAISS Flat baseline on Recall@3.
    • Compact index runs with recompute_embeddings=True.
  • Stage 3 (Binary Search for Complexity):

    • Builds a non-compact index (<index>_noncompact.leann) and runs binary search with recompute_embeddings=False to find the minimal complexity achieving target Recall@3 (default 90%).
  • Stage 4 (Index Comparison):

    • Reports .index-only sizes for compact vs non-compact.
    • Measures timings on queries by default: non-compact (no recompute) vs compact (with recompute).
    • Stores retrieval results for Stage 5 generation evaluation.
    • Fails fast if compact recompute cannot run.
    • If --complexity is not provided, the script tries to use the best complexity from Stage 3:
      • First from the current run (when running --stage all), otherwise
      • From enron_stage3_results.json saved next to the index during the last Stage 3 run.
      • If neither exists, Stage 4 will error and ask you to run Stage 3 or pass --complexity.
  • Stage 5 (Generation Evaluation):

    • Uses Qwen3-8B thinking model for RAG generation on retrieved documents from Stage 4.
    • Supports HuggingFace (hf) and vLLM (vllm) backends.
    • Measures generation timing separately from search timing.
    • Requires Stage 4 results (no additional searching performed).

Example Results

These are sample results obtained on Enron data using all-mpnet-base-v2 and Qwen3-8B.

  • Stage 3 (Binary Search):

    • Minimal complexity achieving 90% Recall@3: 88
    • Sampled points:
      • C=8 → 59.9% Recall@3
      • C=72 → 89.4% Recall@3
      • C=88 → 90.2% Recall@3
      • C=96 → 90.7% Recall@3
      • C=112 → 91.1% Recall@3
      • C=136 → 91.3% Recall@3
      • C=256 → 92.0% Recall@3
  • Stage 4 (Index Sizes, .index only):

    • Compact: ~2.2 MB
    • Non-compact: ~82.0 MB
    • Storage saving by compact: ~97.3%
  • Stage 4 (Search Timing, 988 queries, complexity=88):

    • Non-compact (no recompute): ~0.0075 s avg per query
    • Compact (with recompute): ~1.981 s avg per query
    • Speed ratio (non-compact/compact): ~0.0038x
  • Stage 5 (RAG Generation, 988 queries, Qwen3-8B):

    • Average generation time: ~22.302 s per query
    • Total queries processed: 988
    • LLM backend: HuggingFace transformers
    • Model: Qwen/Qwen3-8B (thinking model with processing)

Full JSON output is saved by the script (see --output), e.g.: benchmarks/enron_emails/results_enron_stage45.json.