docs: data updated

2025-09-15 19:50:02 -07:00
parent d7011bbea0
commit a0d6857faa
9 changed files with 749 additions and 133 deletions
@@ -1,18 +1,19 @@
 # Enron Emails Benchmark

-A retrieval-only benchmark for evaluating LEANN search on the Enron email corpus. It mirrors the structure and CLI of the existing FinanceBench and LAION benches, using stage-based evaluation focused on Recall@3.
+A comprehensive RAG benchmark for evaluating LEANN search and generation on the Enron email corpus. It mirrors the structure and CLI of the existing FinanceBench and LAION benches, using stage-based evaluation with Recall@3 and generation timing.

 - Dataset: Enron email CSV (e.g., Kaggle wcukierski/enron-email-dataset) for passages
 - Queries: corbt/enron_emails_sample_questions (filtered for realistic questions)
- Metric: Recall@3 vs FAISS Flat baseline
+- Metrics: Recall@3 vs FAISS Flat baseline + Generation evaluation with Qwen3-8B

 ## Layout

 benchmarks/enron_emails/
 - setup_enron_emails.py: Prepare passages, build LEANN index, build FAISS baseline
- evaluate_enron_emails.py: Evaluate retrieval recall (Stage 2)
+- evaluate_enron_emails.py: Evaluate retrieval recall (Stages 2-5) + generation with Qwen3-8B
 - data/: Generated passages, queries, embeddings-related files
 - baseline/: FAISS Flat baseline files
+- llm_utils.py: LLM utilities for Qwen3-8B generation (in parent directory)

 ## Quickstart

@@ -41,23 +42,33 @@ Stage 3 uses binary search over complexity to find the minimal value achieving t

 4) Index comparison (Stage 4)

-python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 4 --max-queries 100 --output results.json
+python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 4 --complexity 88 --max-queries 100 --output results.json
+
+5) Generation evaluation (Stage 5)
+
+python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 5 --complexity 88 --llm-backend hf --model-name Qwen/Qwen3-8B
+
+6) Combined index + generation evaluation (Stages 4+5, recommended)
+
+python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 45 --complexity 88 --llm-backend hf

 Notes:
 - Minimal CLI: you can run from repo root with only `--index`, defaults match financebench/laion patterns:
-  - `--stage` defaults to `all` (runs 2, 3, 4)
+  - `--stage` defaults to `all` (runs 2, 3, 4, 5)
  - `--baseline-dir` defaults to `baseline`
  - `--queries` defaults to `data/evaluation_queries.jsonl` (or falls back to the index directory)
+  - `--llm-backend` defaults to `hf` (HuggingFace), can use `vllm`
+  - `--model-name` defaults to `Qwen/Qwen3-8B`
 - Fail-fast behavior: no silent fallbacks. If compact index cannot run with recompute, it errors out.
-
-4) Index comparison (Stage 4)
-
-python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 4 --max-queries 100 --output results.json
+- Stage 5 requires Stage 4 retrieval results. Use `--stage 45` to run both efficiently.

 Optional flags:
 - --queries data/evaluation_queries.jsonl (custom queries file)
 - --baseline-dir baseline (where FAISS baseline lives)
- --complexity 64 (LEANN complexity parameter)
+- --complexity 88 (LEANN complexity parameter, optimal for 90% recall)
+- --llm-backend hf|vllm (LLM backend for generation)
+- --model-name Qwen/Qwen3-8B (LLM model for generation)
+- --max-queries 1000 (limit number of queries for evaluation)

 ## Files Produced
 - data/enron_passages_preview.jsonl: Small preview of passages used (for inspection)
@@ -66,8 +77,9 @@ Optional flags:
 - data/evaluation_queries.jsonl: Query file (id + query; includes GT IDs for reference)

 ## Notes
- We only evaluate retrieval Recall@3 (no generation). This matches the other benches’ style and stage flow.
+- Evaluates both retrieval Recall@3 and generation timing with Qwen3-8B thinking model.
 - The emails CSV must contain a column named "message" (raw RFC822 email) and a column named "file" for source identifier. Message-ID headers are parsed as canonical message IDs when present.
+- Qwen3-8B requires special handling for thinking models with chat templates and <think></think> tag processing.

 ## Stages Summary

@@ -80,16 +92,23 @@ Optional flags:

 - Stage 4 (Index Comparison):
  - Reports .index-only sizes for compact vs non-compact.
-  - Measures timings on 100 queries by default: non-compact (no recompute) vs compact (with recompute).
+  - Measures timings on queries by default: non-compact (no recompute) vs compact (with recompute).
+  - Stores retrieval results for Stage 5 generation evaluation.
  - Fails fast if compact recompute cannot run.
  - If `--complexity` is not provided, the script tries to use the best complexity from Stage 3:
    - First from the current run (when running `--stage all`), otherwise
    - From `enron_stage3_results.json` saved next to the index during the last Stage 3 run.
    - If neither exists, Stage 4 will error and ask you to run Stage 3 or pass `--complexity`.

+- Stage 5 (Generation Evaluation):
+  - Uses Qwen3-8B thinking model for RAG generation on retrieved documents from Stage 4.
+  - Supports HuggingFace (`hf`) and vLLM (`vllm`) backends.
+  - Measures generation timing separately from search timing.
+  - Requires Stage 4 results (no additional searching performed).
+
 ## Example Results

-These are sample results obtained on a subset of Enron data using all-mpnet-base-v2.
+These are sample results obtained on Enron data using all-mpnet-base-v2 and Qwen3-8B.

 - Stage 3 (Binary Search):
  - Minimal complexity achieving 90% Recall@3: 88
@@ -103,14 +122,20 @@ These are sample results obtained on a subset of Enron data using all-mpnet-base
    - C=256 → 92.0% Recall@3

 - Stage 4 (Index Sizes, .index only):
-  - Compact: ~2.17 MB
-  - Non-compact: ~82.03 MB
-  - Storage saving by compact: ~97.35%
+  - Compact: ~2.2 MB
+  - Non-compact: ~82.0 MB
+  - Storage saving by compact: ~97.3%

- Stage 4 (Timing, 100 queries, complexity=88):
-  - Non-compact (no recompute): ~0.0074 s avg per query
-  - Compact (with recompute): ~1.947 s avg per query
+- Stage 4 (Search Timing, 988 queries, complexity=88):
+  - Non-compact (no recompute): ~0.0075 s avg per query
+  - Compact (with recompute): ~1.981 s avg per query
  - Speed ratio (non-compact/compact): ~0.0038x

-Full JSON output for Stage 4 is saved by the script (see `--output`), e.g.:
-`benchmarks/enron_emails/results_enron_stage4.json`.
+- Stage 5 (RAG Generation, 988 queries, Qwen3-8B):
+  - Average generation time: ~22.302 s per query
+  - Total queries processed: 988
+  - LLM backend: HuggingFace transformers
+  - Model: Qwen/Qwen3-8B (thinking model with <think></think> processing)
+
+Full JSON output is saved by the script (see `--output`), e.g.:
+`benchmarks/enron_emails/results_enron_stage45.json`.
@@ -7,13 +7,22 @@ On errors, fail fast without fallbacks.

 import argparse
 import json
+import logging
 import os
 import pickle
+from pathlib import Path

 import numpy as np
 from leann import LeannBuilder, LeannSearcher
 from leann_backend_hnsw import faiss

+from ..llm_utils import generate_hf, generate_vllm, load_hf_model, load_vllm_model
+
+# Setup logging to reduce verbose output
+logging.basicConfig(level=logging.WARNING)
+logging.getLogger("leann.api").setLevel(logging.WARNING)
+logging.getLogger("leann_backend_hnsw").setLevel(logging.WARNING)
+

 class RecallEvaluator:
    """Stage 2: Evaluate Recall@3 (LEANN vs FAISS)"""
@@ -119,7 +128,6 @@ class EnronEvaluator:

    def analyze_index_sizes(self) -> dict:
        """Analyze index sizes (.index only), similar to LAION bench."""
-        from pathlib import Path

        print("📏 Analyzing index sizes (.index only)...")
        index_path = Path(self.index_path)
@@ -150,7 +158,6 @@ class EnronEvaluator:

    def create_non_compact_index_for_comparison(self, non_compact_index_path: str) -> dict:
        """Create a non-compact index for comparison using current passages and embeddings."""
-        from pathlib import Path

        current_index_path = Path(self.index_path)
        current_index_dir = current_index_path.parent
@@ -230,6 +237,7 @@ class EnronEvaluator:
            "compact": {"search_times": []},
            "avg_search_times": {},
            "speed_ratio": 0.0,
+            "retrieval_results": [],  # Store retrieval results for Stage 5
        }

        print("⚡ Comparing search performance between indexes...")
@@ -248,10 +256,15 @@ class EnronEvaluator:
        compact_searcher = LeannSearcher(compact_path)
        for q in test_queries:
            t0 = time.time()
-            _ = compact_searcher.search(
+            docs = compact_searcher.search(
                q, top_k=3, complexity=complexity, recompute_embeddings=True
            )
            results["compact"]["search_times"].append(time.time() - t0)
+
+            # Store retrieval results for Stage 5
+            results["retrieval_results"].append(
+                {"query": q, "retrieved_docs": [{"id": doc.id, "text": doc.text} for doc in docs]}
+            )
        compact_searcher.cleanup()

        if results["non_compact"]["search_times"]:
@@ -358,9 +371,9 @@ def main():
    )
    parser.add_argument(
        "--stage",
-        choices=["2", "3", "4", "all"],
+        choices=["2", "3", "4", "5", "all", "45"],
        default="all",
-        help="Which stage to run (2=recall, 3=complexity, 4=index comparison)",
+        help="Which stage to run (2=recall, 3=complexity, 4=index comparison, 5=generation)",
    )
    parser.add_argument("--complexity", type=int, default=None, help="LEANN search complexity")
    parser.add_argument("--baseline-dir", default="baseline", help="Baseline output directory")
@@ -371,6 +384,8 @@ def main():
        "--target-recall", type=float, default=0.90, help="Target Recall@3 for Stage 3"
    )
    parser.add_argument("--output", help="Save results to JSON file")
+    parser.add_argument("--llm-backend", choices=["hf", "vllm"], default="hf", help="LLM backend")
+    parser.add_argument("--model-name", default="Qwen/Qwen3-8B", help="Model name")

    args = parser.parse_args()

@@ -438,7 +453,7 @@ def main():
        enron_eval.cleanup()
        print("✅ Stage 3 completed!\n")

-    if args.stage in ("4", "all"):
+    if args.stage in ("4", "all", "45"):
        print("🚀 Starting Stage 4: Index size + performance comparison")
        evaluator = RecallEvaluator(args.index, args.baseline_dir)
        enron_eval = EnronEvaluator(args.index)
@@ -503,6 +518,92 @@ def main():
        enron_eval.cleanup()
        print("✅ Stage 4 completed!\n")

+    if args.stage in ("5", "all"):
+        print("🚀 Starting Stage 5: Generation evaluation with Qwen3-8B")
+
+        # Check if Stage 4 results exist
+        if "stage4" not in results_out or "performance_comparison" not in results_out["stage4"]:
+            print("❌ Stage 5 requires Stage 4 retrieval results")
+            print("💡 Run Stage 4 first or use --stage all")
+            raise SystemExit(1)
+
+        retrieval_results = results_out["stage4"]["performance_comparison"]["retrieval_results"]
+        if not retrieval_results:
+            print("❌ No retrieval results found from Stage 4")
+            raise SystemExit(1)
+
+        print(f"📁 Using {len(retrieval_results)} retrieval results from Stage 4")
+
+        # Load LLM
+        try:
+            if args.llm_backend == "hf":
+                tokenizer, model = load_hf_model(args.model_name)
+
+                def llm_func(prompt):
+                    return generate_hf(tokenizer, model, prompt)
+            else:  # vllm
+                llm, sampling_params = load_vllm_model(args.model_name)
+
+                def llm_func(prompt):
+                    return generate_vllm(llm, sampling_params, prompt)
+
+            # Run generation using stored retrieval results
+            import time
+
+            from llm_utils import create_prompt
+
+            generation_times = []
+            responses = []
+
+            print("🤖 Running generation on pre-retrieved results...")
+            for i, item in enumerate(retrieval_results):
+                query = item["query"]
+                retrieved_docs = item["retrieved_docs"]
+
+                # Prepare context from retrieved docs
+                context = "\n\n".join([doc["text"] for doc in retrieved_docs])
+                prompt = create_prompt(context, query, "emails")
+
+                # Time generation only
+                gen_start = time.time()
+                response = llm_func(prompt)
+                gen_time = time.time() - gen_start
+
+                generation_times.append(gen_time)
+                responses.append(response)
+
+                if i < 3:
+                    print(f"  Q{i + 1}: Gen={gen_time:.3f}s")
+
+            avg_gen_time = sum(generation_times) / len(generation_times)
+
+            print("\n📊 Generation Results:")
+            print(f"  Total Queries: {len(retrieval_results)}")
+            print(f"  Avg Generation Time: {avg_gen_time:.3f}s")
+            print("  (Search time from Stage 4)")
+
+            results_out["stage5"] = {
+                "total_queries": len(retrieval_results),
+                "avg_generation_time": avg_gen_time,
+                "generation_times": generation_times,
+                "responses": responses,
+            }
+
+            # Show sample results
+            print("\n📝 Sample Results:")
+            for i in range(min(3, len(retrieval_results))):
+                query = retrieval_results[i]["query"]
+                response = responses[i]
+                print(f"  Q{i + 1}: {query[:60]}...")
+                print(f"  A{i + 1}: {response[:100]}...")
+                print()
+
+        except Exception as e:
+            print(f"❌ Generation evaluation failed: {e}")
+            print("💡 Make sure transformers/vllm is installed and model is available")
+
+        print("✅ Stage 5 completed!\n")
+
    if args.output and results_out:
        with open(args.output, "w", encoding="utf-8") as f:
            json.dump(results_out, f, indent=2)