docs: data updated

2025-09-15 19:50:02 -07:00
parent d7011bbea0
commit a0d6857faa
9 changed files with 749 additions and 133 deletions
--- a/benchmarks/enron_emails/README.md
+++ b/benchmarks/enron_emails/README.md
@@ -1,18 +1,19 @@
 # Enron Emails Benchmark

-A retrieval-only benchmark for evaluating LEANN search on the Enron email corpus. It mirrors the structure and CLI of the existing FinanceBench and LAION benches, using stage-based evaluation focused on Recall@3.
+A comprehensive RAG benchmark for evaluating LEANN search and generation on the Enron email corpus. It mirrors the structure and CLI of the existing FinanceBench and LAION benches, using stage-based evaluation with Recall@3 and generation timing.

 - Dataset: Enron email CSV (e.g., Kaggle wcukierski/enron-email-dataset) for passages
 - Queries: corbt/enron_emails_sample_questions (filtered for realistic questions)
- Metric: Recall@3 vs FAISS Flat baseline
+- Metrics: Recall@3 vs FAISS Flat baseline + Generation evaluation with Qwen3-8B

 ## Layout

 benchmarks/enron_emails/
 - setup_enron_emails.py: Prepare passages, build LEANN index, build FAISS baseline
- evaluate_enron_emails.py: Evaluate retrieval recall (Stage 2)
+- evaluate_enron_emails.py: Evaluate retrieval recall (Stages 2-5) + generation with Qwen3-8B
 - data/: Generated passages, queries, embeddings-related files
 - baseline/: FAISS Flat baseline files
+- llm_utils.py: LLM utilities for Qwen3-8B generation (in parent directory)

 ## Quickstart

@@ -41,23 +42,33 @@ Stage 3 uses binary search over complexity to find the minimal value achieving t

 4) Index comparison (Stage 4)

-python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 4 --max-queries 100 --output results.json
+python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 4 --complexity 88 --max-queries 100 --output results.json
+
+5) Generation evaluation (Stage 5)
+
+python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 5 --complexity 88 --llm-backend hf --model-name Qwen/Qwen3-8B
+
+6) Combined index + generation evaluation (Stages 4+5, recommended)
+
+python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 45 --complexity 88 --llm-backend hf

 Notes:
 - Minimal CLI: you can run from repo root with only `--index`, defaults match financebench/laion patterns:
-  - `--stage` defaults to `all` (runs 2, 3, 4)
+  - `--stage` defaults to `all` (runs 2, 3, 4, 5)
  - `--baseline-dir` defaults to `baseline`
  - `--queries` defaults to `data/evaluation_queries.jsonl` (or falls back to the index directory)
+  - `--llm-backend` defaults to `hf` (HuggingFace), can use `vllm`
+  - `--model-name` defaults to `Qwen/Qwen3-8B`
 - Fail-fast behavior: no silent fallbacks. If compact index cannot run with recompute, it errors out.
-
-4) Index comparison (Stage 4)
-
-python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 4 --max-queries 100 --output results.json
+- Stage 5 requires Stage 4 retrieval results. Use `--stage 45` to run both efficiently.

 Optional flags:
 - --queries data/evaluation_queries.jsonl (custom queries file)
 - --baseline-dir baseline (where FAISS baseline lives)
- --complexity 64 (LEANN complexity parameter)
+- --complexity 88 (LEANN complexity parameter, optimal for 90% recall)
+- --llm-backend hf|vllm (LLM backend for generation)
+- --model-name Qwen/Qwen3-8B (LLM model for generation)
+- --max-queries 1000 (limit number of queries for evaluation)

 ## Files Produced
 - data/enron_passages_preview.jsonl: Small preview of passages used (for inspection)
@@ -66,8 +77,9 @@ Optional flags:
 - data/evaluation_queries.jsonl: Query file (id + query; includes GT IDs for reference)

 ## Notes
- We only evaluate retrieval Recall@3 (no generation). This matches the other benches’ style and stage flow.
+- Evaluates both retrieval Recall@3 and generation timing with Qwen3-8B thinking model.
 - The emails CSV must contain a column named "message" (raw RFC822 email) and a column named "file" for source identifier. Message-ID headers are parsed as canonical message IDs when present.
+- Qwen3-8B requires special handling for thinking models with chat templates and <think></think> tag processing.

 ## Stages Summary

@@ -80,16 +92,23 @@ Optional flags:

 - Stage 4 (Index Comparison):
  - Reports .index-only sizes for compact vs non-compact.
-  - Measures timings on 100 queries by default: non-compact (no recompute) vs compact (with recompute).
+  - Measures timings on queries by default: non-compact (no recompute) vs compact (with recompute).
+  - Stores retrieval results for Stage 5 generation evaluation.
  - Fails fast if compact recompute cannot run.
  - If `--complexity` is not provided, the script tries to use the best complexity from Stage 3:
    - First from the current run (when running `--stage all`), otherwise
    - From `enron_stage3_results.json` saved next to the index during the last Stage 3 run.
    - If neither exists, Stage 4 will error and ask you to run Stage 3 or pass `--complexity`.

+- Stage 5 (Generation Evaluation):
+  - Uses Qwen3-8B thinking model for RAG generation on retrieved documents from Stage 4.
+  - Supports HuggingFace (`hf`) and vLLM (`vllm`) backends.
+  - Measures generation timing separately from search timing.
+  - Requires Stage 4 results (no additional searching performed).
+
 ## Example Results

-These are sample results obtained on a subset of Enron data using all-mpnet-base-v2.
+These are sample results obtained on Enron data using all-mpnet-base-v2 and Qwen3-8B.

 - Stage 3 (Binary Search):
  - Minimal complexity achieving 90% Recall@3: 88
@@ -103,14 +122,20 @@ These are sample results obtained on a subset of Enron data using all-mpnet-base
    - C=256 → 92.0% Recall@3

 - Stage 4 (Index Sizes, .index only):
-  - Compact: ~2.17 MB
-  - Non-compact: ~82.03 MB
-  - Storage saving by compact: ~97.35%
+  - Compact: ~2.2 MB
+  - Non-compact: ~82.0 MB
+  - Storage saving by compact: ~97.3%

- Stage 4 (Timing, 100 queries, complexity=88):
-  - Non-compact (no recompute): ~0.0074 s avg per query
-  - Compact (with recompute): ~1.947 s avg per query
+- Stage 4 (Search Timing, 988 queries, complexity=88):
+  - Non-compact (no recompute): ~0.0075 s avg per query
+  - Compact (with recompute): ~1.981 s avg per query
  - Speed ratio (non-compact/compact): ~0.0038x

-Full JSON output for Stage 4 is saved by the script (see `--output`), e.g.:
-`benchmarks/enron_emails/results_enron_stage4.json`.
+- Stage 5 (RAG Generation, 988 queries, Qwen3-8B):
+  - Average generation time: ~22.302 s per query
+  - Total queries processed: 988
+  - LLM backend: HuggingFace transformers
+  - Model: Qwen/Qwen3-8B (thinking model with <think></think> processing)
+
+Full JSON output is saved by the script (see `--output`), e.g.:
+`benchmarks/enron_emails/results_enron_stage45.json`.
--- a/benchmarks/enron_emails/evaluate_enron_emails.py
+++ b/benchmarks/enron_emails/evaluate_enron_emails.py
@@ -7,13 +7,22 @@ On errors, fail fast without fallbacks.

 import argparse
 import json
+import logging
 import os
 import pickle
+from pathlib import Path

 import numpy as np
 from leann import LeannBuilder, LeannSearcher
 from leann_backend_hnsw import faiss

+from ..llm_utils import generate_hf, generate_vllm, load_hf_model, load_vllm_model
+
+# Setup logging to reduce verbose output
+logging.basicConfig(level=logging.WARNING)
+logging.getLogger("leann.api").setLevel(logging.WARNING)
+logging.getLogger("leann_backend_hnsw").setLevel(logging.WARNING)
+

 class RecallEvaluator:
    """Stage 2: Evaluate Recall@3 (LEANN vs FAISS)"""
@@ -119,7 +128,6 @@ class EnronEvaluator:

    def analyze_index_sizes(self) -> dict:
        """Analyze index sizes (.index only), similar to LAION bench."""
-        from pathlib import Path

        print("📏 Analyzing index sizes (.index only)...")
        index_path = Path(self.index_path)
@@ -150,7 +158,6 @@ class EnronEvaluator:

    def create_non_compact_index_for_comparison(self, non_compact_index_path: str) -> dict:
        """Create a non-compact index for comparison using current passages and embeddings."""
-        from pathlib import Path

        current_index_path = Path(self.index_path)
        current_index_dir = current_index_path.parent
@@ -230,6 +237,7 @@ class EnronEvaluator:
            "compact": {"search_times": []},
            "avg_search_times": {},
            "speed_ratio": 0.0,
+            "retrieval_results": [],  # Store retrieval results for Stage 5
        }

        print("⚡ Comparing search performance between indexes...")
@@ -248,10 +256,15 @@ class EnronEvaluator:
        compact_searcher = LeannSearcher(compact_path)
        for q in test_queries:
            t0 = time.time()
-            _ = compact_searcher.search(
+            docs = compact_searcher.search(
                q, top_k=3, complexity=complexity, recompute_embeddings=True
            )
            results["compact"]["search_times"].append(time.time() - t0)
+
+            # Store retrieval results for Stage 5
+            results["retrieval_results"].append(
+                {"query": q, "retrieved_docs": [{"id": doc.id, "text": doc.text} for doc in docs]}
+            )
        compact_searcher.cleanup()

        if results["non_compact"]["search_times"]:
@@ -358,9 +371,9 @@ def main():
    )
    parser.add_argument(
        "--stage",
-        choices=["2", "3", "4", "all"],
+        choices=["2", "3", "4", "5", "all", "45"],
        default="all",
-        help="Which stage to run (2=recall, 3=complexity, 4=index comparison)",
+        help="Which stage to run (2=recall, 3=complexity, 4=index comparison, 5=generation)",
    )
    parser.add_argument("--complexity", type=int, default=None, help="LEANN search complexity")
    parser.add_argument("--baseline-dir", default="baseline", help="Baseline output directory")
@@ -371,6 +384,8 @@ def main():
        "--target-recall", type=float, default=0.90, help="Target Recall@3 for Stage 3"
    )
    parser.add_argument("--output", help="Save results to JSON file")
+    parser.add_argument("--llm-backend", choices=["hf", "vllm"], default="hf", help="LLM backend")
+    parser.add_argument("--model-name", default="Qwen/Qwen3-8B", help="Model name")

    args = parser.parse_args()

@@ -438,7 +453,7 @@ def main():
        enron_eval.cleanup()
        print("✅ Stage 3 completed!\n")

-    if args.stage in ("4", "all"):
+    if args.stage in ("4", "all", "45"):
        print("🚀 Starting Stage 4: Index size + performance comparison")
        evaluator = RecallEvaluator(args.index, args.baseline_dir)
        enron_eval = EnronEvaluator(args.index)
@@ -503,6 +518,92 @@ def main():
        enron_eval.cleanup()
        print("✅ Stage 4 completed!\n")

+    if args.stage in ("5", "all"):
+        print("🚀 Starting Stage 5: Generation evaluation with Qwen3-8B")
+
+        # Check if Stage 4 results exist
+        if "stage4" not in results_out or "performance_comparison" not in results_out["stage4"]:
+            print("❌ Stage 5 requires Stage 4 retrieval results")
+            print("💡 Run Stage 4 first or use --stage all")
+            raise SystemExit(1)
+
+        retrieval_results = results_out["stage4"]["performance_comparison"]["retrieval_results"]
+        if not retrieval_results:
+            print("❌ No retrieval results found from Stage 4")
+            raise SystemExit(1)
+
+        print(f"📁 Using {len(retrieval_results)} retrieval results from Stage 4")
+
+        # Load LLM
+        try:
+            if args.llm_backend == "hf":
+                tokenizer, model = load_hf_model(args.model_name)
+
+                def llm_func(prompt):
+                    return generate_hf(tokenizer, model, prompt)
+            else:  # vllm
+                llm, sampling_params = load_vllm_model(args.model_name)
+
+                def llm_func(prompt):
+                    return generate_vllm(llm, sampling_params, prompt)
+
+            # Run generation using stored retrieval results
+            import time
+
+            from llm_utils import create_prompt
+
+            generation_times = []
+            responses = []
+
+            print("🤖 Running generation on pre-retrieved results...")
+            for i, item in enumerate(retrieval_results):
+                query = item["query"]
+                retrieved_docs = item["retrieved_docs"]
+
+                # Prepare context from retrieved docs
+                context = "\n\n".join([doc["text"] for doc in retrieved_docs])
+                prompt = create_prompt(context, query, "emails")
+
+                # Time generation only
+                gen_start = time.time()
+                response = llm_func(prompt)
+                gen_time = time.time() - gen_start
+
+                generation_times.append(gen_time)
+                responses.append(response)
+
+                if i < 3:
+                    print(f"  Q{i + 1}: Gen={gen_time:.3f}s")
+
+            avg_gen_time = sum(generation_times) / len(generation_times)
+
+            print("\n📊 Generation Results:")
+            print(f"  Total Queries: {len(retrieval_results)}")
+            print(f"  Avg Generation Time: {avg_gen_time:.3f}s")
+            print("  (Search time from Stage 4)")
+
+            results_out["stage5"] = {
+                "total_queries": len(retrieval_results),
+                "avg_generation_time": avg_gen_time,
+                "generation_times": generation_times,
+                "responses": responses,
+            }
+
+            # Show sample results
+            print("\n📝 Sample Results:")
+            for i in range(min(3, len(retrieval_results))):
+                query = retrieval_results[i]["query"]
+                response = responses[i]
+                print(f"  Q{i + 1}: {query[:60]}...")
+                print(f"  A{i + 1}: {response[:100]}...")
+                print()
+
+        except Exception as e:
+            print(f"❌ Generation evaluation failed: {e}")
+            print("💡 Make sure transformers/vllm is installed and model is available")
+
+        print("✅ Stage 5 completed!\n")
+
    if args.output and results_out:
        with open(args.output, "w", encoding="utf-8") as f:
            json.dump(results_out, f, indent=2)
--- a/benchmarks/financebench/README.md
+++ b/benchmarks/financebench/README.md
@@ -45,9 +45,9 @@ This will:
 # Basic retrieval evaluation
 python evaluate_financebench.py --index data/index/financebench_full_hnsw.leann

-# Include QA evaluation with OpenAI
-export OPENAI_API_KEY="your-key"
-python evaluate_financebench.py --index data/index/financebench_full_hnsw.leann --qa-samples 20
+
+# RAG generation evaluation with Qwen3-8B
+python evaluate_financebench.py --index data/index/financebench_full_hnsw.leann --stage 4 --complexity 64 --llm-backend hf --model-name Qwen/Qwen3-8B --output results_qwen3.json
 ```

 ## Evaluation Methods
@@ -85,6 +85,24 @@ LLM-based answer evaluation using GPT-4o:

 *Note: Number match rate >100% indicates multiple retrieved documents contain the same financial figures, which is expected behavior for financial data appearing across multiple document sections.

+### LEANN-RAG Generation Performance (Qwen3-8B)
+
+- **Stage 4 (Index Comparison):**
+  - Compact Index: 5.0 MB
+  - Non-compact Index: 172.2 MB
+  - **Storage Saving**: 97.1%
+- **Search Performance**:
+  - Non-compact (no recompute): 0.009s avg per query
+  - Compact (with recompute): 2.203s avg per query
+  - Speed ratio: 0.004x
+
+**Generation Evaluation (20 queries, complexity=64):**
+- **Average Search Time**: 1.638s per query
+- **Average Generation Time**: 45.957s per query
+- **LLM Backend**: HuggingFace transformers
+- **Model**: Qwen/Qwen3-8B (thinking model with <think></think> processing)
+- **Total Questions Processed**: 20
+
 ## Options

 ```bash
--- a/benchmarks/financebench/evaluate_financebench.py
+++ b/benchmarks/financebench/evaluate_financebench.py
@@ -4,20 +4,25 @@ FinanceBench Evaluation Script - Modular Recall-based Evaluation

 import argparse
 import json
+import logging
 import os
 import pickle
 import time
+from pathlib import Path
 from typing import Optional

 import numpy as np
 import openai
-
-# Import LEANN modules - this will bring in the modified faiss
 from leann import LeannChat, LeannSearcher
-
-# Import LEANN's modified faiss directly
 from leann_backend_hnsw import faiss

+from ..llm_utils import evaluate_rag, generate_hf, generate_vllm, load_hf_model, load_vllm_model
+
+# Setup logging to reduce verbose output
+logging.basicConfig(level=logging.WARNING)
+logging.getLogger("leann.api").setLevel(logging.WARNING)
+logging.getLogger("leann_backend_hnsw").setLevel(logging.WARNING)
+

 class RecallEvaluator:
    """Stage 2: Evaluate Recall@3 (searcher vs baseline)"""
@@ -125,7 +130,6 @@ class FinanceBenchEvaluator:

    def analyze_index_sizes(self) -> dict:
        """Analyze index sizes with and without embeddings"""
-        from pathlib import Path

        print("📏 Analyzing index sizes...")

@@ -136,7 +140,6 @@ class FinanceBenchEvaluator:

        sizes = {}
        total_with_embeddings = 0
-        total_without_embeddings = 0

        # Core index files
        index_file = index_dir / f"{index_name}.index"
@@ -155,28 +158,14 @@ class FinanceBenchEvaluator:
                sizes[name] = size_mb
                total_with_embeddings += size_mb

-                # For pruned index calculation, exclude the main index file (contains embeddings)
-                if name != "index":
-                    total_without_embeddings += size_mb
            else:
                sizes[name] = 0

-        # Estimate pruned index size (approximate)
-        # When embeddings are removed, the main index file becomes much smaller
-        # Rough estimate: graph structure is ~10-20% of full index size
-        estimated_pruned_index_size = sizes["index"] * 0.15  # Conservative estimate
-        total_without_embeddings += estimated_pruned_index_size
-
        sizes["total_with_embeddings"] = total_with_embeddings
-        sizes["total_without_embeddings"] = total_without_embeddings
-        sizes["estimated_pruned_index"] = estimated_pruned_index_size
-        sizes["compression_ratio"] = (
-            total_without_embeddings / total_with_embeddings if total_with_embeddings > 0 else 0
-        )
+        sizes["index_only_mb"] = sizes["index"]  # Just the .index file for fair comparison

-        print(f"  📁 Index with embeddings: {total_with_embeddings:.1f} MB")
-        print(f"  📁 Estimated pruned index: {total_without_embeddings:.1f} MB")
-        print(f"  🗜️  Compression ratio: {sizes['compression_ratio']:.2f}x")
+        print(f"  📁 Total index size: {total_with_embeddings:.1f} MB")
+        print(f"  📁 Index file only: {sizes['index']:.1f} MB")

        return sizes

@@ -185,7 +174,6 @@ class FinanceBenchEvaluator:
        print("🏗️ Building compact index from existing passages...")

        # Load existing passages from current index
-        from pathlib import Path

        from leann import LeannBuilder

@@ -241,7 +229,6 @@ class FinanceBenchEvaluator:
        print("🏗️ Building non-compact index from existing passages...")

        # Load existing passages from current index
-        from pathlib import Path

        from leann import LeannBuilder

@@ -555,13 +542,7 @@ Respond with exactly one word: "CORRECT" if the generated answer is factually ac
        # Legacy single index analysis (fallback)
        if "total_with_embeddings" in timing_metrics and "current_index" not in timing_metrics:
            print("\n📏 Index Size Analysis:")
-            print(
-                f"  Index with embeddings: {timing_metrics.get('total_with_embeddings', 0):.1f} MB"
-            )
-            print(
-                f"  Estimated pruned index: {timing_metrics.get('total_without_embeddings', 0):.1f} MB"
-            )
-            print(f"  Compression ratio: {timing_metrics.get('compression_ratio', 0):.2f}x")
+            print(f"  Total index size: {timing_metrics.get('total_with_embeddings', 0):.1f} MB")

        print("\n📊 Accuracy:")
        print(f"  Accuracy: {timing_metrics.get('accuracy', 0) * 100:.1f}%")
@@ -610,6 +591,10 @@ def main():
    parser.add_argument("--baseline-dir", default="baseline", help="Baseline output directory")
    parser.add_argument("--openai-api-key", help="OpenAI API key for generation evaluation")
    parser.add_argument("--output", help="Save results to JSON file")
+    parser.add_argument(
+        "--llm-backend", choices=["openai", "hf", "vllm"], default="openai", help="LLM backend"
+    )
+    parser.add_argument("--model-name", default="Qwen3-8B", help="Model name for HF/vLLM")

    args = parser.parse_args()

@@ -768,7 +753,9 @@ def main():
            print("🚀 Starting Stage 4: Comprehensive evaluation with dual index comparison")

            # Use FinanceBench evaluator for QA evaluation
-            evaluator = FinanceBenchEvaluator(args.index, args.openai_api_key)
+            evaluator = FinanceBenchEvaluator(
+                args.index, args.openai_api_key if args.llm_backend == "openai" else None
+            )

            print("📖 Loading FinanceBench dataset...")
            data = evaluator.load_dataset(args.dataset)
@@ -802,20 +789,13 @@ def main():
            print(
                f"  Non-compact index: {non_compact_size_metrics['total_with_embeddings']:.1f} MB"
            )
-            _ = (
-                (
-                    non_compact_size_metrics["total_with_embeddings"]
-                    - compact_size_metrics["total_with_embeddings"]
-                )
-                / compact_size_metrics["total_with_embeddings"]
-                * 100
-            )
+            print("\n📊 Index-only size comparison (.index file only):")
+            print(f"  Compact index: {compact_size_metrics['index_only_mb']:.1f} MB")
+            print(f"  Non-compact index: {non_compact_size_metrics['index_only_mb']:.1f} MB")
+            # Use index-only size for fair comparison (same as Enron emails)
            storage_saving = (
-                (
-                    non_compact_size_metrics["total_with_embeddings"]
-                    - compact_size_metrics["total_with_embeddings"]
-                )
-                / non_compact_size_metrics["total_with_embeddings"]
+                (non_compact_size_metrics["index_only_mb"] - compact_size_metrics["index_only_mb"])
+                / non_compact_size_metrics["index_only_mb"]
                * 100
            )
            print(f"  Storage saving by compact: {storage_saving:.1f}%")
@@ -829,15 +809,58 @@ def main():
                non_compact_index_path, args.index, data[:10], complexity=complexity
            )

-            # Step 5: Timing breakdown evaluation WITH recompute (production mode)
+            # Step 5: Generation evaluation
            test_samples = 20
-            print(f"\n🧪 Testing with first {test_samples} samples for timing analysis")
-            print(
-                "\n🔍🤖 Running timing breakdown evaluation (WITH recompute - production mode)..."
-            )
-            evaluation_start = time.time()
-            timing_metrics = evaluator.evaluate_timing_breakdown(data[:test_samples])
-            evaluation_time = time.time() - evaluation_start
+            print(f"\n🧪 Testing with first {test_samples} samples for generation analysis")
+
+            if args.llm_backend == "openai" and args.openai_api_key:
+                print("🔍🤖 Running OpenAI-based generation evaluation...")
+                evaluation_start = time.time()
+                timing_metrics = evaluator.evaluate_timing_breakdown(data[:test_samples])
+                evaluation_time = time.time() - evaluation_start
+            else:
+                print(
+                    f"🔍🤖 Running {args.llm_backend} generation evaluation with {args.model_name}..."
+                )
+                try:
+                    # Load LLM
+                    if args.llm_backend == "hf":
+                        tokenizer, model = load_hf_model(args.model_name)
+
+                        def llm_func(prompt):
+                            return generate_hf(tokenizer, model, prompt)
+                    else:  # vllm
+                        llm, sampling_params = load_vllm_model(args.model_name)
+
+                        def llm_func(prompt):
+                            return generate_vllm(llm, sampling_params, prompt)
+
+                    # Simple generation evaluation
+                    queries = [item["question"] for item in data[:test_samples]]
+                    gen_results = evaluate_rag(
+                        evaluator.searcher,
+                        llm_func,
+                        queries,
+                        domain="finance",
+                        complexity=complexity,
+                    )
+
+                    timing_metrics = {
+                        "total_questions": len(queries),
+                        "avg_search_time": gen_results["avg_search_time"],
+                        "avg_generation_time": gen_results["avg_generation_time"],
+                        "results": gen_results["results"],
+                    }
+                    evaluation_time = time.time()
+
+                except Exception as e:
+                    print(f"❌ Generation evaluation failed: {e}")
+                    timing_metrics = {
+                        "total_questions": 0,
+                        "avg_search_time": 0,
+                        "avg_generation_time": 0,
+                    }
+                    evaluation_time = 0

            # Combine all metrics
            combined_metrics = {
@@ -849,8 +872,11 @@ def main():
                "storage_saving_percent": storage_saving,
            }

-            # Print comprehensive results
-            evaluator._print_results(combined_metrics)
+            # Print results
+            print("\n📊 Generation Results:")
+            print(f"  Total Questions: {timing_metrics.get('total_questions', 0)}")
+            print(f"  Avg Search Time: {timing_metrics.get('avg_search_time', 0):.3f}s")
+            print(f"  Avg Generation Time: {timing_metrics.get('avg_generation_time', 0):.3f}s")

            # Save results if requested
            if args.output:
--- a/benchmarks/laion/README.md
+++ b/benchmarks/laion/README.md
@@ -1,6 +1,6 @@
 # LAION Multimodal Benchmark

-A multimodal benchmark for evaluating image retrieval performance using LEANN with CLIP embeddings on LAION dataset subset.
+A multimodal benchmark for evaluating image retrieval and generation performance using LEANN with CLIP embeddings and Qwen2.5-VL for multimodal generation on LAION dataset subset.

 ## Overview

@@ -9,6 +9,7 @@ This benchmark evaluates:
 - **Recall@K performance** for image search
 - **Complexity analysis** across different search parameters
 - **Index size and storage efficiency**
+- **Multimodal generation** with Qwen2.5-VL for image understanding and description

 ## Dataset Configuration

@@ -39,9 +40,13 @@ This will:
 python evaluate_laion.py --index data/laion_index.leann

 # Run specific stages
-python evaluate_laion.py --index data/laion_index.leann --stage timing
-python evaluate_laion.py --index data/laion_index.leann --stage recall
-python evaluate_laion.py --index data/laion_index.leann --stage complexity
+python evaluate_laion.py --index data/laion_index.leann --stage 2  # Recall evaluation
+python evaluate_laion.py --index data/laion_index.leann --stage 3  # Complexity analysis
+python evaluate_laion.py --index data/laion_index.leann --stage 4  # Index comparison
+python evaluate_laion.py --index data/laion_index.leann --stage 5  # Multimodal generation
+
+# Multimodal generation with Qwen2.5-VL
+python evaluate_laion.py --index data/laion_index.leann --stage 5 --model-name Qwen/Qwen2.5-VL-7B-Instruct
 ```

 ### 3. Save results
@@ -74,23 +79,26 @@ python evaluate_laion.py \

 ## Evaluation Stages

-### Stage 1: Index Analysis
- Analyzes index file sizes and metadata
- Reports storage efficiency
-
-### Stage 2: Search Timing
- Measures average search latency
- Tests with configurable complexity and top-k
- Reports searches per second
-
-### Stage 3: Recall Evaluation
- Evaluates Recall@K using ground truth
+### Stage 2: Recall Evaluation
+- Evaluates Recall@3 for multimodal retrieval
+- Compares LEANN vs FAISS baseline performance
 - Self-recall: query caption should retrieve original image

-### Stage 4: Complexity Analysis
- Tests performance across different complexity levels [16, 32, 64, 128]
+### Stage 3: Complexity Analysis
+- Binary search for optimal complexity (90% recall target)
+- Tests performance across different complexity levels
 - Analyzes speed vs. accuracy tradeoffs

+### Stage 4: Index Comparison
+- Compares compact vs non-compact index sizes
+- Measures search performance differences
+- Reports storage efficiency and speed ratios
+
+### Stage 5: Multimodal Generation
+- Uses Qwen2.5-VL for image understanding and description
+- Retrieval-Augmented Generation (RAG) with multimodal context
+- Measures both search and generation timing
+
 ## Output Metrics

 ### Timing Metrics
@@ -100,48 +108,70 @@ python evaluate_laion.py \
 - Latency in milliseconds

 ### Recall Metrics
- Recall@K percentage
+- Recall@3 percentage for image retrieval
 - Number of queries with ground truth

 ### Index Metrics
 - Total index size (MB)
 - Component breakdown (index, passages, metadata)
+- Storage savings (compact vs non-compact)
 - Backend and embedding model info

-## Example Results
+### Generation Metrics (Stage 5)
+- Average search time per query
+- Average generation time per query
+- Time distribution (search vs generation)
+- Sample multimodal responses
+- Model: Qwen2.5-VL performance
+
+## Benchmark Results
+
+### LEANN-RAG Performance (CLIP ViT-L/14 + Qwen2.5-VL)
+
+**Stage 3: Optimal Complexity Analysis**
+- **Optimal Complexity**: 85 (achieving 90% Recall@3)
+- **Binary Search Range**: 1-128
+- **Target Recall**: 90%
+- **Index Type**: Non-compact (for fast binary search)
+
+**Stage 5: Multimodal Generation Performance (Qwen2.5-VL)**
+- **Total Queries**: 20
+- **Average Search Time**: 1.200s per query
+- **Average Generation Time**: 6.558s per query
+- **Time Distribution**: Search 15.5%, Generation 84.5%
+- **LLM Backend**: HuggingFace transformers
+- **Model**: Qwen/Qwen2.5-VL-7B-Instruct
+- **Optimal Complexity**: 85
+
+**System Performance:**
+- **Index Size**: ~10,000 image embeddings from LAION subset
+- **Embedding Model**: CLIP ViT-L/14 (768 dimensions)
+- **Backend**: HNSW with cosine distance
+
+### Example Results

 ```
 🎯 LAION MULTIMODAL BENCHMARK RESULTS
 ============================================================

-📏 Index Information:
-  Total size: 145.2 MB
-  Backend: hnsw
-  Embedding model: clip-vit-b-32
-  Total passages: 10000
+📊 Multimodal Generation Results:
+  Total Queries: 20
+  Avg Search Time: 1.200s
+  Avg Generation Time: 6.558s
+  Time Distribution: Search 15.5%, Generation 84.5%
+  LLM Backend: HuggingFace transformers
+  Model: Qwen/Qwen2.5-VL-7B-Instruct

-⚡ Search Performance:
-  Total queries: 200
-  Average search time: 0.023s
-  Median search time: 0.021s
-  Min/Max search time: 0.012s / 0.089s
-  Std dev: 0.008s
-  Complexity: 64
-  Top-K: 3
-
-📊 Recall Performance:
-  Recall@3: 85.5%
-  Queries with ground truth: 200
-
-⚙️ Complexity Analysis:
-  Complexity  16: 0.015s avg
-  Complexity  32: 0.019s avg
-  Complexity  64: 0.023s avg
-  Complexity 128: 0.031s avg
+⚙️ Optimal Complexity Analysis:
+  Target Recall: 90%
+  Optimal Complexity: 85
+  Binary Search Range: 1-128
+  Non-compact Index (fast search, no recompute)

 🚀 Performance Summary:
-  Searches per second: 43.5
-  Latency (ms): 23.0ms
+  Multimodal RAG: 7.758s total per query
+  Search: 15.5% of total time
+  Generation: 84.5% of total time
 ```

 ## Directory Structure
--- a/benchmarks/laion/evaluate_laion.py
+++ b/benchmarks/laion/evaluate_laion.py
@@ -4,6 +4,7 @@ LAION Multimodal Benchmark Evaluation Script - Modular Recall-based Evaluation

 import argparse
 import json
+import logging
 import os
 import pickle
 import time
@@ -14,6 +15,13 @@ from leann import LeannSearcher
 from leann_backend_hnsw import faiss
 from sentence_transformers import SentenceTransformer

+from ..llm_utils import evaluate_multimodal_rag, load_qwen_vl_model
+
+# Setup logging to reduce verbose output
+logging.basicConfig(level=logging.WARNING)
+logging.getLogger("leann.api").setLevel(logging.WARNING)
+logging.getLogger("leann_backend_hnsw").setLevel(logging.WARNING)
+

 class RecallEvaluator:
    """Stage 2: Evaluate Recall@3 (LEANN vs FAISS baseline for multimodal retrieval)"""
@@ -388,13 +396,22 @@ def main():
    )
    parser.add_argument(
        "--stage",
-        choices=["2", "3", "4", "all"],
+        choices=["2", "3", "4", "5", "all"],
        default="all",
-        help="Which stage to run (2=recall, 3=complexity, 4=index comparison)",
+        help="Which stage to run (2=recall, 3=complexity, 4=index comparison, 5=generation)",
    )
    parser.add_argument("--complexity", type=int, default=None, help="Complexity for search")
    parser.add_argument("--baseline-dir", default="baseline", help="Baseline output directory")
    parser.add_argument("--output", help="Save results to JSON file")
+    parser.add_argument(
+        "--llm-backend",
+        choices=["hf"],
+        default="hf",
+        help="LLM backend (Qwen2.5-VL only supports HF)",
+    )
+    parser.add_argument(
+        "--model-name", default="Qwen/Qwen2.5-VL-7B-Instruct", help="Multimodal model name"
+    )

    args = parser.parse_args()

@@ -615,12 +632,69 @@ def main():
            evaluator.cleanup()
            print("✅ Stage 4 completed!\n")

+        if args.stage in ("5", "all"):
+            print("🚀 Starting Stage 5: Multimodal generation with Qwen2.5-VL")
+            evaluator = LAIONEvaluator(args.index)
+            captions = evaluator.load_queries(args.queries)
+            test_captions = captions[: min(20, len(captions))]  # Use subset for generation
+
+            print(f"🧪 Testing multimodal generation with {len(test_captions)} queries")
+
+            # Load Qwen2.5-VL model
+            try:
+                print("Loading Qwen2.5-VL model...")
+                processor, model = load_qwen_vl_model(args.model_name)
+
+                # Run multimodal generation evaluation
+                complexity = args.complexity or 64
+                gen_results = evaluate_multimodal_rag(
+                    evaluator.searcher,
+                    test_captions,
+                    processor=processor,
+                    model=model,
+                    complexity=complexity,
+                )
+
+                print("\n📊 Multimodal Generation Results:")
+                print(f"  Total Queries: {len(test_captions)}")
+                print(f"  Avg Search Time: {gen_results['avg_search_time']:.3f}s")
+                print(f"  Avg Generation Time: {gen_results['avg_generation_time']:.3f}s")
+                total_time = gen_results["avg_search_time"] + gen_results["avg_generation_time"]
+                search_pct = (gen_results["avg_search_time"] / total_time) * 100
+                gen_pct = (gen_results["avg_generation_time"] / total_time) * 100
+                print(f"  Time Distribution: Search {search_pct:.1f}%, Generation {gen_pct:.1f}%")
+                print("  LLM Backend: HuggingFace transformers")
+                print(f"  Model: {args.model_name}")
+
+                # Show sample results
+                print("\n📝 Sample Multimodal Generations:")
+                for i, response in enumerate(gen_results["results"][:3]):
+                    # Handle both string and dict formats for captions
+                    if isinstance(test_captions[i], dict):
+                        caption_text = test_captions[i].get("query", str(test_captions[i]))
+                    else:
+                        caption_text = str(test_captions[i])
+                    print(f"  Query {i + 1}: {caption_text[:60]}...")
+                    print(f"  Response {i + 1}: {response[:100]}...")
+                    print()
+
+            except Exception as e:
+                print(f"❌ Multimodal generation evaluation failed: {e}")
+                print("💡 Make sure transformers and Qwen2.5-VL are installed")
+                import traceback
+
+                traceback.print_exc()
+
+            evaluator.cleanup()
+            print("✅ Stage 5 completed!\n")
+
        if args.stage == "all":
            print("🎉 All evaluation stages completed successfully!")
            print("\n📋 Summary:")
            print("  Stage 2: ✅ Multimodal Recall@3 evaluation completed")
            print("  Stage 3: ✅ Optimal complexity found")
            print("  Stage 4: ✅ Index comparison analysis completed")
+            print("  Stage 5: ✅ Multimodal generation evaluation completed")
            print("\n🔧 Recommended next steps:")
            print("  - Use optimal complexity for best speed/accuracy balance")
            print("  - Review index comparison for storage vs performance tradeoffs")
--- a/benchmarks/llm_utils.py
+++ b/benchmarks/llm_utils.py
@@ -0,0 +1,301 @@
+"""
+LLM utils for RAG benchmarks with Qwen3-8B and Qwen2.5-VL (multimodal)
+"""
+
+import time
+
+try:
+    import torch
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+
+    HF_AVAILABLE = True
+except ImportError:
+    HF_AVAILABLE = False
+
+try:
+    from vllm import LLM, SamplingParams
+
+    VLLM_AVAILABLE = True
+except ImportError:
+    VLLM_AVAILABLE = False
+
+
+def is_qwen3_model(model_name):
+    """Check if model is Qwen3"""
+    return "Qwen3" in model_name or "qwen3" in model_name.lower()
+
+
+def is_qwen_vl_model(model_name):
+    """Check if model is Qwen2.5-VL"""
+    return "Qwen2.5-VL" in model_name or "qwen2.5-vl" in model_name.lower()
+
+
+def apply_qwen3_chat_template(tokenizer, prompt):
+    """Apply Qwen3 chat template with thinking enabled"""
+    messages = [{"role": "user", "content": prompt}]
+    return tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
+        enable_thinking=True,
+    )
+
+
+def extract_thinking_answer(response):
+    """Extract final answer from Qwen3 thinking model response"""
+    if "<think>" in response and "</think>" in response:
+        try:
+            think_end = response.index("</think>") + len("</think>")
+            final_answer = response[think_end:].strip()
+            return final_answer
+        except (ValueError, IndexError):
+            pass
+
+    return response.strip()
+
+
+def load_hf_model(model_name="Qwen/Qwen3-8B"):
+    """Load HuggingFace model"""
+    if not HF_AVAILABLE:
+        raise ImportError("transformers not available")
+
+    print(f"Loading HF: {model_name}")
+    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+    model = AutoModelForCausalLM.from_pretrained(
+        model_name,
+        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
+        device_map="auto",
+        trust_remote_code=True,
+    )
+    return tokenizer, model
+
+
+def load_vllm_model(model_name="Qwen/Qwen3-8B"):
+    """Load vLLM model"""
+    if not VLLM_AVAILABLE:
+        raise ImportError("vllm not available")
+
+    print(f"Loading vLLM: {model_name}")
+    llm = LLM(model=model_name, trust_remote_code=True)
+
+    # Qwen3 specific config
+    if is_qwen3_model(model_name):
+        stop_tokens = ["<|im_end|>", "<|end_of_text|>"]
+        max_tokens = 2048
+    else:
+        stop_tokens = None
+        max_tokens = 1024
+
+    sampling_params = SamplingParams(temperature=0.7, max_tokens=max_tokens, stop=stop_tokens)
+    return llm, sampling_params
+
+
+def generate_hf(tokenizer, model, prompt, max_tokens=None):
+    """Generate with HF - supports Qwen3 thinking models"""
+    model_name = getattr(model, "name_or_path", "unknown")
+    is_qwen3 = is_qwen3_model(model_name)
+
+    # Apply chat template for Qwen3
+    if is_qwen3:
+        prompt = apply_qwen3_chat_template(tokenizer, prompt)
+        max_tokens = max_tokens or 2048
+    else:
+        max_tokens = max_tokens or 1024
+
+    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+    with torch.no_grad():
+        outputs = model.generate(
+            **inputs,
+            max_new_tokens=max_tokens,
+            temperature=0.7,
+            do_sample=True,
+            pad_token_id=tokenizer.eos_token_id,
+        )
+    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    response = response[len(prompt) :].strip()
+
+    # Extract final answer for thinking models
+    if is_qwen3:
+        return extract_thinking_answer(response)
+    return response
+
+
+def generate_vllm(llm, sampling_params, prompt):
+    """Generate with vLLM - supports Qwen3 thinking models"""
+    outputs = llm.generate([prompt], sampling_params)
+    response = outputs[0].outputs[0].text.strip()
+
+    # Extract final answer for Qwen3 thinking models
+    model_name = str(llm.llm_engine.model_config.model)
+    if is_qwen3_model(model_name):
+        return extract_thinking_answer(response)
+    return response
+
+
+def create_prompt(context, query, domain="default"):
+    """Create RAG prompt"""
+    if domain == "emails":
+        return f"Email content:\n{context}\n\nQuestion: {query}\n\nAnswer:"
+    elif domain == "finance":
+        return f"Financial content:\n{context}\n\nQuestion: {query}\n\nAnswer:"
+    elif domain == "multimodal":
+        return f"Image context:\n{context}\n\nQuestion: {query}\n\nAnswer:"
+    else:
+        return f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
+
+
+def evaluate_rag(searcher, llm_func, queries, domain="default", top_k=3, complexity=64):
+    """Simple RAG evaluation with timing"""
+    search_times = []
+    gen_times = []
+    results = []
+
+    for i, query in enumerate(queries):
+        # Search
+        start = time.time()
+        docs = searcher.search(query, top_k=top_k, complexity=complexity)
+        search_time = time.time() - start
+
+        # Generate
+        context = "\n\n".join([doc.text for doc in docs])
+        prompt = create_prompt(context, query, domain)
+
+        start = time.time()
+        response = llm_func(prompt)
+        gen_time = time.time() - start
+
+        search_times.append(search_time)
+        gen_times.append(gen_time)
+        results.append(response)
+
+        if i < 3:
+            print(f"Q{i + 1}: Search={search_time:.3f}s, Gen={gen_time:.3f}s")
+
+    return {
+        "avg_search_time": sum(search_times) / len(search_times),
+        "avg_generation_time": sum(gen_times) / len(gen_times),
+        "results": results,
+    }
+
+
+def load_qwen_vl_model(model_name="Qwen/Qwen2.5-VL-7B-Instruct"):
+    """Load Qwen2.5-VL multimodal model"""
+    if not HF_AVAILABLE:
+        raise ImportError("transformers not available")
+
+    print(f"Loading Qwen2.5-VL: {model_name}")
+
+    try:
+        from transformers import AutoModelForVision2Seq, AutoProcessor
+
+        processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
+        model = AutoModelForVision2Seq.from_pretrained(
+            model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
+        )
+
+        return processor, model
+
+    except Exception as e:
+        print(f"Failed to load with AutoModelForVision2Seq, trying specific class: {e}")
+
+        # Fallback to specific class
+        try:
+            from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
+
+            processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
+            model = Qwen2VLForConditionalGeneration.from_pretrained(
+                model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
+            )
+
+            return processor, model
+
+        except Exception as e2:
+            raise ImportError(f"Failed to load Qwen2.5-VL model: {e2}")
+
+
+def generate_qwen_vl(processor, model, prompt, image_path=None, max_tokens=512):
+    """Generate with Qwen2.5-VL multimodal model"""
+    from PIL import Image
+
+    # Prepare inputs
+    if image_path:
+        image = Image.open(image_path)
+        inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
+    else:
+        inputs = processor(text=prompt, return_tensors="pt").to(model.device)
+
+    # Generate
+    with torch.no_grad():
+        generated_ids = model.generate(
+            **inputs, max_new_tokens=max_tokens, do_sample=False, temperature=0.1
+        )
+
+    # Decode response
+    generated_ids = generated_ids[:, inputs["input_ids"].shape[1] :]
+    response = processor.decode(generated_ids[0], skip_special_tokens=True)
+
+    return response
+
+
+def create_multimodal_prompt(context, query, image_descriptions, task_type="images"):
+    """Create prompt for multimodal RAG"""
+    if task_type == "images":
+        return f"""Based on the retrieved images and their descriptions, answer the following question.
+
+Retrieved Image Descriptions:
+{context}
+
+Question: {query}
+
+Provide a detailed answer based on the visual content described above."""
+
+    return f"Context: {context}\nQuestion: {query}\nAnswer:"
+
+
+def evaluate_multimodal_rag(searcher, queries, processor=None, model=None, complexity=64):
+    """Evaluate multimodal RAG with Qwen2.5-VL"""
+    search_times = []
+    gen_times = []
+    results = []
+
+    for i, query_item in enumerate(queries):
+        # Handle both string and dict formats for queries
+        if isinstance(query_item, dict):
+            query = query_item.get("query", "")
+            image_path = query_item.get("image_path")  # Optional reference image
+        else:
+            query = str(query_item)
+            image_path = None
+
+        # Search
+        start_time = time.time()
+        search_results = searcher.search(query, top_k=3, complexity=complexity)
+        search_time = time.time() - start_time
+        search_times.append(search_time)
+
+        # Prepare context from search results
+        context_parts = []
+        for result in search_results:
+            context_parts.append(f"- {result.text}")
+        context = "\n".join(context_parts)
+
+        # Generate with multimodal model
+        start_time = time.time()
+        if processor and model:
+            prompt = create_multimodal_prompt(context, query, context_parts)
+            response = generate_qwen_vl(processor, model, prompt, image_path)
+        else:
+            response = f"Context: {context}"
+        gen_time = time.time() - start_time
+
+        gen_times.append(gen_time)
+        results.append(response)
+
+        if i < 3:
+            print(f"Q{i + 1}: Search={search_time:.3f}s, Gen={gen_time:.3f}s")
+
+    return {
+        "avg_search_time": sum(search_times) / len(search_times),
+        "avg_generation_time": sum(gen_times) / len(gen_times),
+        "results": results,
+    }