docs: data updated

This commit is contained in:
Andy Lee
2025-09-15 19:50:02 -07:00
parent d7011bbea0
commit a0d6857faa
9 changed files with 749 additions and 133 deletions

View File

@@ -1,18 +1,19 @@
# Enron Emails Benchmark # Enron Emails Benchmark
A retrieval-only benchmark for evaluating LEANN search on the Enron email corpus. It mirrors the structure and CLI of the existing FinanceBench and LAION benches, using stage-based evaluation focused on Recall@3. A comprehensive RAG benchmark for evaluating LEANN search and generation on the Enron email corpus. It mirrors the structure and CLI of the existing FinanceBench and LAION benches, using stage-based evaluation with Recall@3 and generation timing.
- Dataset: Enron email CSV (e.g., Kaggle wcukierski/enron-email-dataset) for passages - Dataset: Enron email CSV (e.g., Kaggle wcukierski/enron-email-dataset) for passages
- Queries: corbt/enron_emails_sample_questions (filtered for realistic questions) - Queries: corbt/enron_emails_sample_questions (filtered for realistic questions)
- Metric: Recall@3 vs FAISS Flat baseline - Metrics: Recall@3 vs FAISS Flat baseline + Generation evaluation with Qwen3-8B
## Layout ## Layout
benchmarks/enron_emails/ benchmarks/enron_emails/
- setup_enron_emails.py: Prepare passages, build LEANN index, build FAISS baseline - setup_enron_emails.py: Prepare passages, build LEANN index, build FAISS baseline
- evaluate_enron_emails.py: Evaluate retrieval recall (Stage 2) - evaluate_enron_emails.py: Evaluate retrieval recall (Stages 2-5) + generation with Qwen3-8B
- data/: Generated passages, queries, embeddings-related files - data/: Generated passages, queries, embeddings-related files
- baseline/: FAISS Flat baseline files - baseline/: FAISS Flat baseline files
- llm_utils.py: LLM utilities for Qwen3-8B generation (in parent directory)
## Quickstart ## Quickstart
@@ -41,23 +42,33 @@ Stage 3 uses binary search over complexity to find the minimal value achieving t
4) Index comparison (Stage 4) 4) Index comparison (Stage 4)
python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 4 --max-queries 100 --output results.json python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 4 --complexity 88 --max-queries 100 --output results.json
5) Generation evaluation (Stage 5)
python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 5 --complexity 88 --llm-backend hf --model-name Qwen/Qwen3-8B
6) Combined index + generation evaluation (Stages 4+5, recommended)
python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 45 --complexity 88 --llm-backend hf
Notes: Notes:
- Minimal CLI: you can run from repo root with only `--index`, defaults match financebench/laion patterns: - Minimal CLI: you can run from repo root with only `--index`, defaults match financebench/laion patterns:
- `--stage` defaults to `all` (runs 2, 3, 4) - `--stage` defaults to `all` (runs 2, 3, 4, 5)
- `--baseline-dir` defaults to `baseline` - `--baseline-dir` defaults to `baseline`
- `--queries` defaults to `data/evaluation_queries.jsonl` (or falls back to the index directory) - `--queries` defaults to `data/evaluation_queries.jsonl` (or falls back to the index directory)
- `--llm-backend` defaults to `hf` (HuggingFace), can use `vllm`
- `--model-name` defaults to `Qwen/Qwen3-8B`
- Fail-fast behavior: no silent fallbacks. If compact index cannot run with recompute, it errors out. - Fail-fast behavior: no silent fallbacks. If compact index cannot run with recompute, it errors out.
- Stage 5 requires Stage 4 retrieval results. Use `--stage 45` to run both efficiently.
4) Index comparison (Stage 4)
python evaluate_enron_emails.py --index data/enron_index_hnsw.leann --stage 4 --max-queries 100 --output results.json
Optional flags: Optional flags:
- --queries data/evaluation_queries.jsonl (custom queries file) - --queries data/evaluation_queries.jsonl (custom queries file)
- --baseline-dir baseline (where FAISS baseline lives) - --baseline-dir baseline (where FAISS baseline lives)
- --complexity 64 (LEANN complexity parameter) - --complexity 88 (LEANN complexity parameter, optimal for 90% recall)
- --llm-backend hf|vllm (LLM backend for generation)
- --model-name Qwen/Qwen3-8B (LLM model for generation)
- --max-queries 1000 (limit number of queries for evaluation)
## Files Produced ## Files Produced
- data/enron_passages_preview.jsonl: Small preview of passages used (for inspection) - data/enron_passages_preview.jsonl: Small preview of passages used (for inspection)
@@ -66,8 +77,9 @@ Optional flags:
- data/evaluation_queries.jsonl: Query file (id + query; includes GT IDs for reference) - data/evaluation_queries.jsonl: Query file (id + query; includes GT IDs for reference)
## Notes ## Notes
- We only evaluate retrieval Recall@3 (no generation). This matches the other benches style and stage flow. - Evaluates both retrieval Recall@3 and generation timing with Qwen3-8B thinking model.
- The emails CSV must contain a column named "message" (raw RFC822 email) and a column named "file" for source identifier. Message-ID headers are parsed as canonical message IDs when present. - The emails CSV must contain a column named "message" (raw RFC822 email) and a column named "file" for source identifier. Message-ID headers are parsed as canonical message IDs when present.
- Qwen3-8B requires special handling for thinking models with chat templates and <think></think> tag processing.
## Stages Summary ## Stages Summary
@@ -80,16 +92,23 @@ Optional flags:
- Stage 4 (Index Comparison): - Stage 4 (Index Comparison):
- Reports .index-only sizes for compact vs non-compact. - Reports .index-only sizes for compact vs non-compact.
- Measures timings on 100 queries by default: non-compact (no recompute) vs compact (with recompute). - Measures timings on queries by default: non-compact (no recompute) vs compact (with recompute).
- Stores retrieval results for Stage 5 generation evaluation.
- Fails fast if compact recompute cannot run. - Fails fast if compact recompute cannot run.
- If `--complexity` is not provided, the script tries to use the best complexity from Stage 3: - If `--complexity` is not provided, the script tries to use the best complexity from Stage 3:
- First from the current run (when running `--stage all`), otherwise - First from the current run (when running `--stage all`), otherwise
- From `enron_stage3_results.json` saved next to the index during the last Stage 3 run. - From `enron_stage3_results.json` saved next to the index during the last Stage 3 run.
- If neither exists, Stage 4 will error and ask you to run Stage 3 or pass `--complexity`. - If neither exists, Stage 4 will error and ask you to run Stage 3 or pass `--complexity`.
- Stage 5 (Generation Evaluation):
- Uses Qwen3-8B thinking model for RAG generation on retrieved documents from Stage 4.
- Supports HuggingFace (`hf`) and vLLM (`vllm`) backends.
- Measures generation timing separately from search timing.
- Requires Stage 4 results (no additional searching performed).
## Example Results ## Example Results
These are sample results obtained on a subset of Enron data using all-mpnet-base-v2. These are sample results obtained on Enron data using all-mpnet-base-v2 and Qwen3-8B.
- Stage 3 (Binary Search): - Stage 3 (Binary Search):
- Minimal complexity achieving 90% Recall@3: 88 - Minimal complexity achieving 90% Recall@3: 88
@@ -103,14 +122,20 @@ These are sample results obtained on a subset of Enron data using all-mpnet-base
- C=256 → 92.0% Recall@3 - C=256 → 92.0% Recall@3
- Stage 4 (Index Sizes, .index only): - Stage 4 (Index Sizes, .index only):
- Compact: ~2.17 MB - Compact: ~2.2 MB
- Non-compact: ~82.03 MB - Non-compact: ~82.0 MB
- Storage saving by compact: ~97.35% - Storage saving by compact: ~97.3%
- Stage 4 (Timing, 100 queries, complexity=88): - Stage 4 (Search Timing, 988 queries, complexity=88):
- Non-compact (no recompute): ~0.0074 s avg per query - Non-compact (no recompute): ~0.0075 s avg per query
- Compact (with recompute): ~1.947 s avg per query - Compact (with recompute): ~1.981 s avg per query
- Speed ratio (non-compact/compact): ~0.0038x - Speed ratio (non-compact/compact): ~0.0038x
Full JSON output for Stage 4 is saved by the script (see `--output`), e.g.: - Stage 5 (RAG Generation, 988 queries, Qwen3-8B):
`benchmarks/enron_emails/results_enron_stage4.json`. - Average generation time: ~22.302 s per query
- Total queries processed: 988
- LLM backend: HuggingFace transformers
- Model: Qwen/Qwen3-8B (thinking model with <think></think> processing)
Full JSON output is saved by the script (see `--output`), e.g.:
`benchmarks/enron_emails/results_enron_stage45.json`.

View File

@@ -7,13 +7,22 @@ On errors, fail fast without fallbacks.
import argparse import argparse
import json import json
import logging
import os import os
import pickle import pickle
from pathlib import Path
import numpy as np import numpy as np
from leann import LeannBuilder, LeannSearcher from leann import LeannBuilder, LeannSearcher
from leann_backend_hnsw import faiss from leann_backend_hnsw import faiss
from ..llm_utils import generate_hf, generate_vllm, load_hf_model, load_vllm_model
# Setup logging to reduce verbose output
logging.basicConfig(level=logging.WARNING)
logging.getLogger("leann.api").setLevel(logging.WARNING)
logging.getLogger("leann_backend_hnsw").setLevel(logging.WARNING)
class RecallEvaluator: class RecallEvaluator:
"""Stage 2: Evaluate Recall@3 (LEANN vs FAISS)""" """Stage 2: Evaluate Recall@3 (LEANN vs FAISS)"""
@@ -119,7 +128,6 @@ class EnronEvaluator:
def analyze_index_sizes(self) -> dict: def analyze_index_sizes(self) -> dict:
"""Analyze index sizes (.index only), similar to LAION bench.""" """Analyze index sizes (.index only), similar to LAION bench."""
from pathlib import Path
print("📏 Analyzing index sizes (.index only)...") print("📏 Analyzing index sizes (.index only)...")
index_path = Path(self.index_path) index_path = Path(self.index_path)
@@ -150,7 +158,6 @@ class EnronEvaluator:
def create_non_compact_index_for_comparison(self, non_compact_index_path: str) -> dict: def create_non_compact_index_for_comparison(self, non_compact_index_path: str) -> dict:
"""Create a non-compact index for comparison using current passages and embeddings.""" """Create a non-compact index for comparison using current passages and embeddings."""
from pathlib import Path
current_index_path = Path(self.index_path) current_index_path = Path(self.index_path)
current_index_dir = current_index_path.parent current_index_dir = current_index_path.parent
@@ -230,6 +237,7 @@ class EnronEvaluator:
"compact": {"search_times": []}, "compact": {"search_times": []},
"avg_search_times": {}, "avg_search_times": {},
"speed_ratio": 0.0, "speed_ratio": 0.0,
"retrieval_results": [], # Store retrieval results for Stage 5
} }
print("⚡ Comparing search performance between indexes...") print("⚡ Comparing search performance between indexes...")
@@ -248,10 +256,15 @@ class EnronEvaluator:
compact_searcher = LeannSearcher(compact_path) compact_searcher = LeannSearcher(compact_path)
for q in test_queries: for q in test_queries:
t0 = time.time() t0 = time.time()
_ = compact_searcher.search( docs = compact_searcher.search(
q, top_k=3, complexity=complexity, recompute_embeddings=True q, top_k=3, complexity=complexity, recompute_embeddings=True
) )
results["compact"]["search_times"].append(time.time() - t0) results["compact"]["search_times"].append(time.time() - t0)
# Store retrieval results for Stage 5
results["retrieval_results"].append(
{"query": q, "retrieved_docs": [{"id": doc.id, "text": doc.text} for doc in docs]}
)
compact_searcher.cleanup() compact_searcher.cleanup()
if results["non_compact"]["search_times"]: if results["non_compact"]["search_times"]:
@@ -358,9 +371,9 @@ def main():
) )
parser.add_argument( parser.add_argument(
"--stage", "--stage",
choices=["2", "3", "4", "all"], choices=["2", "3", "4", "5", "all", "45"],
default="all", default="all",
help="Which stage to run (2=recall, 3=complexity, 4=index comparison)", help="Which stage to run (2=recall, 3=complexity, 4=index comparison, 5=generation)",
) )
parser.add_argument("--complexity", type=int, default=None, help="LEANN search complexity") parser.add_argument("--complexity", type=int, default=None, help="LEANN search complexity")
parser.add_argument("--baseline-dir", default="baseline", help="Baseline output directory") parser.add_argument("--baseline-dir", default="baseline", help="Baseline output directory")
@@ -371,6 +384,8 @@ def main():
"--target-recall", type=float, default=0.90, help="Target Recall@3 for Stage 3" "--target-recall", type=float, default=0.90, help="Target Recall@3 for Stage 3"
) )
parser.add_argument("--output", help="Save results to JSON file") parser.add_argument("--output", help="Save results to JSON file")
parser.add_argument("--llm-backend", choices=["hf", "vllm"], default="hf", help="LLM backend")
parser.add_argument("--model-name", default="Qwen/Qwen3-8B", help="Model name")
args = parser.parse_args() args = parser.parse_args()
@@ -438,7 +453,7 @@ def main():
enron_eval.cleanup() enron_eval.cleanup()
print("✅ Stage 3 completed!\n") print("✅ Stage 3 completed!\n")
if args.stage in ("4", "all"): if args.stage in ("4", "all", "45"):
print("🚀 Starting Stage 4: Index size + performance comparison") print("🚀 Starting Stage 4: Index size + performance comparison")
evaluator = RecallEvaluator(args.index, args.baseline_dir) evaluator = RecallEvaluator(args.index, args.baseline_dir)
enron_eval = EnronEvaluator(args.index) enron_eval = EnronEvaluator(args.index)
@@ -503,6 +518,92 @@ def main():
enron_eval.cleanup() enron_eval.cleanup()
print("✅ Stage 4 completed!\n") print("✅ Stage 4 completed!\n")
if args.stage in ("5", "all"):
print("🚀 Starting Stage 5: Generation evaluation with Qwen3-8B")
# Check if Stage 4 results exist
if "stage4" not in results_out or "performance_comparison" not in results_out["stage4"]:
print("❌ Stage 5 requires Stage 4 retrieval results")
print("💡 Run Stage 4 first or use --stage all")
raise SystemExit(1)
retrieval_results = results_out["stage4"]["performance_comparison"]["retrieval_results"]
if not retrieval_results:
print("❌ No retrieval results found from Stage 4")
raise SystemExit(1)
print(f"📁 Using {len(retrieval_results)} retrieval results from Stage 4")
# Load LLM
try:
if args.llm_backend == "hf":
tokenizer, model = load_hf_model(args.model_name)
def llm_func(prompt):
return generate_hf(tokenizer, model, prompt)
else: # vllm
llm, sampling_params = load_vllm_model(args.model_name)
def llm_func(prompt):
return generate_vllm(llm, sampling_params, prompt)
# Run generation using stored retrieval results
import time
from llm_utils import create_prompt
generation_times = []
responses = []
print("🤖 Running generation on pre-retrieved results...")
for i, item in enumerate(retrieval_results):
query = item["query"]
retrieved_docs = item["retrieved_docs"]
# Prepare context from retrieved docs
context = "\n\n".join([doc["text"] for doc in retrieved_docs])
prompt = create_prompt(context, query, "emails")
# Time generation only
gen_start = time.time()
response = llm_func(prompt)
gen_time = time.time() - gen_start
generation_times.append(gen_time)
responses.append(response)
if i < 3:
print(f" Q{i + 1}: Gen={gen_time:.3f}s")
avg_gen_time = sum(generation_times) / len(generation_times)
print("\n📊 Generation Results:")
print(f" Total Queries: {len(retrieval_results)}")
print(f" Avg Generation Time: {avg_gen_time:.3f}s")
print(" (Search time from Stage 4)")
results_out["stage5"] = {
"total_queries": len(retrieval_results),
"avg_generation_time": avg_gen_time,
"generation_times": generation_times,
"responses": responses,
}
# Show sample results
print("\n📝 Sample Results:")
for i in range(min(3, len(retrieval_results))):
query = retrieval_results[i]["query"]
response = responses[i]
print(f" Q{i + 1}: {query[:60]}...")
print(f" A{i + 1}: {response[:100]}...")
print()
except Exception as e:
print(f"❌ Generation evaluation failed: {e}")
print("💡 Make sure transformers/vllm is installed and model is available")
print("✅ Stage 5 completed!\n")
if args.output and results_out: if args.output and results_out:
with open(args.output, "w", encoding="utf-8") as f: with open(args.output, "w", encoding="utf-8") as f:
json.dump(results_out, f, indent=2) json.dump(results_out, f, indent=2)

View File

@@ -45,9 +45,9 @@ This will:
# Basic retrieval evaluation # Basic retrieval evaluation
python evaluate_financebench.py --index data/index/financebench_full_hnsw.leann python evaluate_financebench.py --index data/index/financebench_full_hnsw.leann
# Include QA evaluation with OpenAI
export OPENAI_API_KEY="your-key" # RAG generation evaluation with Qwen3-8B
python evaluate_financebench.py --index data/index/financebench_full_hnsw.leann --qa-samples 20 python evaluate_financebench.py --index data/index/financebench_full_hnsw.leann --stage 4 --complexity 64 --llm-backend hf --model-name Qwen/Qwen3-8B --output results_qwen3.json
``` ```
## Evaluation Methods ## Evaluation Methods
@@ -85,6 +85,24 @@ LLM-based answer evaluation using GPT-4o:
*Note: Number match rate >100% indicates multiple retrieved documents contain the same financial figures, which is expected behavior for financial data appearing across multiple document sections. *Note: Number match rate >100% indicates multiple retrieved documents contain the same financial figures, which is expected behavior for financial data appearing across multiple document sections.
### LEANN-RAG Generation Performance (Qwen3-8B)
- **Stage 4 (Index Comparison):**
- Compact Index: 5.0 MB
- Non-compact Index: 172.2 MB
- **Storage Saving**: 97.1%
- **Search Performance**:
- Non-compact (no recompute): 0.009s avg per query
- Compact (with recompute): 2.203s avg per query
- Speed ratio: 0.004x
**Generation Evaluation (20 queries, complexity=64):**
- **Average Search Time**: 1.638s per query
- **Average Generation Time**: 45.957s per query
- **LLM Backend**: HuggingFace transformers
- **Model**: Qwen/Qwen3-8B (thinking model with <think></think> processing)
- **Total Questions Processed**: 20
## Options ## Options
```bash ```bash

View File

@@ -4,20 +4,25 @@ FinanceBench Evaluation Script - Modular Recall-based Evaluation
import argparse import argparse
import json import json
import logging
import os import os
import pickle import pickle
import time import time
from pathlib import Path
from typing import Optional from typing import Optional
import numpy as np import numpy as np
import openai import openai
# Import LEANN modules - this will bring in the modified faiss
from leann import LeannChat, LeannSearcher from leann import LeannChat, LeannSearcher
# Import LEANN's modified faiss directly
from leann_backend_hnsw import faiss from leann_backend_hnsw import faiss
from ..llm_utils import evaluate_rag, generate_hf, generate_vllm, load_hf_model, load_vllm_model
# Setup logging to reduce verbose output
logging.basicConfig(level=logging.WARNING)
logging.getLogger("leann.api").setLevel(logging.WARNING)
logging.getLogger("leann_backend_hnsw").setLevel(logging.WARNING)
class RecallEvaluator: class RecallEvaluator:
"""Stage 2: Evaluate Recall@3 (searcher vs baseline)""" """Stage 2: Evaluate Recall@3 (searcher vs baseline)"""
@@ -125,7 +130,6 @@ class FinanceBenchEvaluator:
def analyze_index_sizes(self) -> dict: def analyze_index_sizes(self) -> dict:
"""Analyze index sizes with and without embeddings""" """Analyze index sizes with and without embeddings"""
from pathlib import Path
print("📏 Analyzing index sizes...") print("📏 Analyzing index sizes...")
@@ -136,7 +140,6 @@ class FinanceBenchEvaluator:
sizes = {} sizes = {}
total_with_embeddings = 0 total_with_embeddings = 0
total_without_embeddings = 0
# Core index files # Core index files
index_file = index_dir / f"{index_name}.index" index_file = index_dir / f"{index_name}.index"
@@ -155,28 +158,14 @@ class FinanceBenchEvaluator:
sizes[name] = size_mb sizes[name] = size_mb
total_with_embeddings += size_mb total_with_embeddings += size_mb
# For pruned index calculation, exclude the main index file (contains embeddings)
if name != "index":
total_without_embeddings += size_mb
else: else:
sizes[name] = 0 sizes[name] = 0
# Estimate pruned index size (approximate)
# When embeddings are removed, the main index file becomes much smaller
# Rough estimate: graph structure is ~10-20% of full index size
estimated_pruned_index_size = sizes["index"] * 0.15 # Conservative estimate
total_without_embeddings += estimated_pruned_index_size
sizes["total_with_embeddings"] = total_with_embeddings sizes["total_with_embeddings"] = total_with_embeddings
sizes["total_without_embeddings"] = total_without_embeddings sizes["index_only_mb"] = sizes["index"] # Just the .index file for fair comparison
sizes["estimated_pruned_index"] = estimated_pruned_index_size
sizes["compression_ratio"] = (
total_without_embeddings / total_with_embeddings if total_with_embeddings > 0 else 0
)
print(f" 📁 Index with embeddings: {total_with_embeddings:.1f} MB") print(f" 📁 Total index size: {total_with_embeddings:.1f} MB")
print(f" 📁 Estimated pruned index: {total_without_embeddings:.1f} MB") print(f" 📁 Index file only: {sizes['index']:.1f} MB")
print(f" 🗜️ Compression ratio: {sizes['compression_ratio']:.2f}x")
return sizes return sizes
@@ -185,7 +174,6 @@ class FinanceBenchEvaluator:
print("🏗️ Building compact index from existing passages...") print("🏗️ Building compact index from existing passages...")
# Load existing passages from current index # Load existing passages from current index
from pathlib import Path
from leann import LeannBuilder from leann import LeannBuilder
@@ -241,7 +229,6 @@ class FinanceBenchEvaluator:
print("🏗️ Building non-compact index from existing passages...") print("🏗️ Building non-compact index from existing passages...")
# Load existing passages from current index # Load existing passages from current index
from pathlib import Path
from leann import LeannBuilder from leann import LeannBuilder
@@ -555,13 +542,7 @@ Respond with exactly one word: "CORRECT" if the generated answer is factually ac
# Legacy single index analysis (fallback) # Legacy single index analysis (fallback)
if "total_with_embeddings" in timing_metrics and "current_index" not in timing_metrics: if "total_with_embeddings" in timing_metrics and "current_index" not in timing_metrics:
print("\n📏 Index Size Analysis:") print("\n📏 Index Size Analysis:")
print( print(f" Total index size: {timing_metrics.get('total_with_embeddings', 0):.1f} MB")
f" Index with embeddings: {timing_metrics.get('total_with_embeddings', 0):.1f} MB"
)
print(
f" Estimated pruned index: {timing_metrics.get('total_without_embeddings', 0):.1f} MB"
)
print(f" Compression ratio: {timing_metrics.get('compression_ratio', 0):.2f}x")
print("\n📊 Accuracy:") print("\n📊 Accuracy:")
print(f" Accuracy: {timing_metrics.get('accuracy', 0) * 100:.1f}%") print(f" Accuracy: {timing_metrics.get('accuracy', 0) * 100:.1f}%")
@@ -610,6 +591,10 @@ def main():
parser.add_argument("--baseline-dir", default="baseline", help="Baseline output directory") parser.add_argument("--baseline-dir", default="baseline", help="Baseline output directory")
parser.add_argument("--openai-api-key", help="OpenAI API key for generation evaluation") parser.add_argument("--openai-api-key", help="OpenAI API key for generation evaluation")
parser.add_argument("--output", help="Save results to JSON file") parser.add_argument("--output", help="Save results to JSON file")
parser.add_argument(
"--llm-backend", choices=["openai", "hf", "vllm"], default="openai", help="LLM backend"
)
parser.add_argument("--model-name", default="Qwen3-8B", help="Model name for HF/vLLM")
args = parser.parse_args() args = parser.parse_args()
@@ -768,7 +753,9 @@ def main():
print("🚀 Starting Stage 4: Comprehensive evaluation with dual index comparison") print("🚀 Starting Stage 4: Comprehensive evaluation with dual index comparison")
# Use FinanceBench evaluator for QA evaluation # Use FinanceBench evaluator for QA evaluation
evaluator = FinanceBenchEvaluator(args.index, args.openai_api_key) evaluator = FinanceBenchEvaluator(
args.index, args.openai_api_key if args.llm_backend == "openai" else None
)
print("📖 Loading FinanceBench dataset...") print("📖 Loading FinanceBench dataset...")
data = evaluator.load_dataset(args.dataset) data = evaluator.load_dataset(args.dataset)
@@ -802,20 +789,13 @@ def main():
print( print(
f" Non-compact index: {non_compact_size_metrics['total_with_embeddings']:.1f} MB" f" Non-compact index: {non_compact_size_metrics['total_with_embeddings']:.1f} MB"
) )
_ = ( print("\n📊 Index-only size comparison (.index file only):")
( print(f" Compact index: {compact_size_metrics['index_only_mb']:.1f} MB")
non_compact_size_metrics["total_with_embeddings"] print(f" Non-compact index: {non_compact_size_metrics['index_only_mb']:.1f} MB")
- compact_size_metrics["total_with_embeddings"] # Use index-only size for fair comparison (same as Enron emails)
)
/ compact_size_metrics["total_with_embeddings"]
* 100
)
storage_saving = ( storage_saving = (
( (non_compact_size_metrics["index_only_mb"] - compact_size_metrics["index_only_mb"])
non_compact_size_metrics["total_with_embeddings"] / non_compact_size_metrics["index_only_mb"]
- compact_size_metrics["total_with_embeddings"]
)
/ non_compact_size_metrics["total_with_embeddings"]
* 100 * 100
) )
print(f" Storage saving by compact: {storage_saving:.1f}%") print(f" Storage saving by compact: {storage_saving:.1f}%")
@@ -829,15 +809,58 @@ def main():
non_compact_index_path, args.index, data[:10], complexity=complexity non_compact_index_path, args.index, data[:10], complexity=complexity
) )
# Step 5: Timing breakdown evaluation WITH recompute (production mode) # Step 5: Generation evaluation
test_samples = 20 test_samples = 20
print(f"\n🧪 Testing with first {test_samples} samples for timing analysis") print(f"\n🧪 Testing with first {test_samples} samples for generation analysis")
print(
"\n🔍🤖 Running timing breakdown evaluation (WITH recompute - production mode)..." if args.llm_backend == "openai" and args.openai_api_key:
) print("🔍🤖 Running OpenAI-based generation evaluation...")
evaluation_start = time.time() evaluation_start = time.time()
timing_metrics = evaluator.evaluate_timing_breakdown(data[:test_samples]) timing_metrics = evaluator.evaluate_timing_breakdown(data[:test_samples])
evaluation_time = time.time() - evaluation_start evaluation_time = time.time() - evaluation_start
else:
print(
f"🔍🤖 Running {args.llm_backend} generation evaluation with {args.model_name}..."
)
try:
# Load LLM
if args.llm_backend == "hf":
tokenizer, model = load_hf_model(args.model_name)
def llm_func(prompt):
return generate_hf(tokenizer, model, prompt)
else: # vllm
llm, sampling_params = load_vllm_model(args.model_name)
def llm_func(prompt):
return generate_vllm(llm, sampling_params, prompt)
# Simple generation evaluation
queries = [item["question"] for item in data[:test_samples]]
gen_results = evaluate_rag(
evaluator.searcher,
llm_func,
queries,
domain="finance",
complexity=complexity,
)
timing_metrics = {
"total_questions": len(queries),
"avg_search_time": gen_results["avg_search_time"],
"avg_generation_time": gen_results["avg_generation_time"],
"results": gen_results["results"],
}
evaluation_time = time.time()
except Exception as e:
print(f"❌ Generation evaluation failed: {e}")
timing_metrics = {
"total_questions": 0,
"avg_search_time": 0,
"avg_generation_time": 0,
}
evaluation_time = 0
# Combine all metrics # Combine all metrics
combined_metrics = { combined_metrics = {
@@ -849,8 +872,11 @@ def main():
"storage_saving_percent": storage_saving, "storage_saving_percent": storage_saving,
} }
# Print comprehensive results # Print results
evaluator._print_results(combined_metrics) print("\n📊 Generation Results:")
print(f" Total Questions: {timing_metrics.get('total_questions', 0)}")
print(f" Avg Search Time: {timing_metrics.get('avg_search_time', 0):.3f}s")
print(f" Avg Generation Time: {timing_metrics.get('avg_generation_time', 0):.3f}s")
# Save results if requested # Save results if requested
if args.output: if args.output:

View File

@@ -1,6 +1,6 @@
# LAION Multimodal Benchmark # LAION Multimodal Benchmark
A multimodal benchmark for evaluating image retrieval performance using LEANN with CLIP embeddings on LAION dataset subset. A multimodal benchmark for evaluating image retrieval and generation performance using LEANN with CLIP embeddings and Qwen2.5-VL for multimodal generation on LAION dataset subset.
## Overview ## Overview
@@ -9,6 +9,7 @@ This benchmark evaluates:
- **Recall@K performance** for image search - **Recall@K performance** for image search
- **Complexity analysis** across different search parameters - **Complexity analysis** across different search parameters
- **Index size and storage efficiency** - **Index size and storage efficiency**
- **Multimodal generation** with Qwen2.5-VL for image understanding and description
## Dataset Configuration ## Dataset Configuration
@@ -39,9 +40,13 @@ This will:
python evaluate_laion.py --index data/laion_index.leann python evaluate_laion.py --index data/laion_index.leann
# Run specific stages # Run specific stages
python evaluate_laion.py --index data/laion_index.leann --stage timing python evaluate_laion.py --index data/laion_index.leann --stage 2 # Recall evaluation
python evaluate_laion.py --index data/laion_index.leann --stage recall python evaluate_laion.py --index data/laion_index.leann --stage 3 # Complexity analysis
python evaluate_laion.py --index data/laion_index.leann --stage complexity python evaluate_laion.py --index data/laion_index.leann --stage 4 # Index comparison
python evaluate_laion.py --index data/laion_index.leann --stage 5 # Multimodal generation
# Multimodal generation with Qwen2.5-VL
python evaluate_laion.py --index data/laion_index.leann --stage 5 --model-name Qwen/Qwen2.5-VL-7B-Instruct
``` ```
### 3. Save results ### 3. Save results
@@ -74,23 +79,26 @@ python evaluate_laion.py \
## Evaluation Stages ## Evaluation Stages
### Stage 1: Index Analysis ### Stage 2: Recall Evaluation
- Analyzes index file sizes and metadata - Evaluates Recall@3 for multimodal retrieval
- Reports storage efficiency - Compares LEANN vs FAISS baseline performance
### Stage 2: Search Timing
- Measures average search latency
- Tests with configurable complexity and top-k
- Reports searches per second
### Stage 3: Recall Evaluation
- Evaluates Recall@K using ground truth
- Self-recall: query caption should retrieve original image - Self-recall: query caption should retrieve original image
### Stage 4: Complexity Analysis ### Stage 3: Complexity Analysis
- Tests performance across different complexity levels [16, 32, 64, 128] - Binary search for optimal complexity (90% recall target)
- Tests performance across different complexity levels
- Analyzes speed vs. accuracy tradeoffs - Analyzes speed vs. accuracy tradeoffs
### Stage 4: Index Comparison
- Compares compact vs non-compact index sizes
- Measures search performance differences
- Reports storage efficiency and speed ratios
### Stage 5: Multimodal Generation
- Uses Qwen2.5-VL for image understanding and description
- Retrieval-Augmented Generation (RAG) with multimodal context
- Measures both search and generation timing
## Output Metrics ## Output Metrics
### Timing Metrics ### Timing Metrics
@@ -100,48 +108,70 @@ python evaluate_laion.py \
- Latency in milliseconds - Latency in milliseconds
### Recall Metrics ### Recall Metrics
- Recall@K percentage - Recall@3 percentage for image retrieval
- Number of queries with ground truth - Number of queries with ground truth
### Index Metrics ### Index Metrics
- Total index size (MB) - Total index size (MB)
- Component breakdown (index, passages, metadata) - Component breakdown (index, passages, metadata)
- Storage savings (compact vs non-compact)
- Backend and embedding model info - Backend and embedding model info
## Example Results ### Generation Metrics (Stage 5)
- Average search time per query
- Average generation time per query
- Time distribution (search vs generation)
- Sample multimodal responses
- Model: Qwen2.5-VL performance
## Benchmark Results
### LEANN-RAG Performance (CLIP ViT-L/14 + Qwen2.5-VL)
**Stage 3: Optimal Complexity Analysis**
- **Optimal Complexity**: 85 (achieving 90% Recall@3)
- **Binary Search Range**: 1-128
- **Target Recall**: 90%
- **Index Type**: Non-compact (for fast binary search)
**Stage 5: Multimodal Generation Performance (Qwen2.5-VL)**
- **Total Queries**: 20
- **Average Search Time**: 1.200s per query
- **Average Generation Time**: 6.558s per query
- **Time Distribution**: Search 15.5%, Generation 84.5%
- **LLM Backend**: HuggingFace transformers
- **Model**: Qwen/Qwen2.5-VL-7B-Instruct
- **Optimal Complexity**: 85
**System Performance:**
- **Index Size**: ~10,000 image embeddings from LAION subset
- **Embedding Model**: CLIP ViT-L/14 (768 dimensions)
- **Backend**: HNSW with cosine distance
### Example Results
``` ```
🎯 LAION MULTIMODAL BENCHMARK RESULTS 🎯 LAION MULTIMODAL BENCHMARK RESULTS
============================================================ ============================================================
📏 Index Information: 📊 Multimodal Generation Results:
Total size: 145.2 MB Total Queries: 20
Backend: hnsw Avg Search Time: 1.200s
Embedding model: clip-vit-b-32 Avg Generation Time: 6.558s
Total passages: 10000 Time Distribution: Search 15.5%, Generation 84.5%
LLM Backend: HuggingFace transformers
Model: Qwen/Qwen2.5-VL-7B-Instruct
⚡ Search Performance: ⚙️ Optimal Complexity Analysis:
Total queries: 200 Target Recall: 90%
Average search time: 0.023s Optimal Complexity: 85
Median search time: 0.021s Binary Search Range: 1-128
Min/Max search time: 0.012s / 0.089s Non-compact Index (fast search, no recompute)
Std dev: 0.008s
Complexity: 64
Top-K: 3
📊 Recall Performance:
Recall@3: 85.5%
Queries with ground truth: 200
⚙️ Complexity Analysis:
Complexity 16: 0.015s avg
Complexity 32: 0.019s avg
Complexity 64: 0.023s avg
Complexity 128: 0.031s avg
🚀 Performance Summary: 🚀 Performance Summary:
Searches per second: 43.5 Multimodal RAG: 7.758s total per query
Latency (ms): 23.0ms Search: 15.5% of total time
Generation: 84.5% of total time
``` ```
## Directory Structure ## Directory Structure

View File

@@ -4,6 +4,7 @@ LAION Multimodal Benchmark Evaluation Script - Modular Recall-based Evaluation
import argparse import argparse
import json import json
import logging
import os import os
import pickle import pickle
import time import time
@@ -14,6 +15,13 @@ from leann import LeannSearcher
from leann_backend_hnsw import faiss from leann_backend_hnsw import faiss
from sentence_transformers import SentenceTransformer from sentence_transformers import SentenceTransformer
from ..llm_utils import evaluate_multimodal_rag, load_qwen_vl_model
# Setup logging to reduce verbose output
logging.basicConfig(level=logging.WARNING)
logging.getLogger("leann.api").setLevel(logging.WARNING)
logging.getLogger("leann_backend_hnsw").setLevel(logging.WARNING)
class RecallEvaluator: class RecallEvaluator:
"""Stage 2: Evaluate Recall@3 (LEANN vs FAISS baseline for multimodal retrieval)""" """Stage 2: Evaluate Recall@3 (LEANN vs FAISS baseline for multimodal retrieval)"""
@@ -388,13 +396,22 @@ def main():
) )
parser.add_argument( parser.add_argument(
"--stage", "--stage",
choices=["2", "3", "4", "all"], choices=["2", "3", "4", "5", "all"],
default="all", default="all",
help="Which stage to run (2=recall, 3=complexity, 4=index comparison)", help="Which stage to run (2=recall, 3=complexity, 4=index comparison, 5=generation)",
) )
parser.add_argument("--complexity", type=int, default=None, help="Complexity for search") parser.add_argument("--complexity", type=int, default=None, help="Complexity for search")
parser.add_argument("--baseline-dir", default="baseline", help="Baseline output directory") parser.add_argument("--baseline-dir", default="baseline", help="Baseline output directory")
parser.add_argument("--output", help="Save results to JSON file") parser.add_argument("--output", help="Save results to JSON file")
parser.add_argument(
"--llm-backend",
choices=["hf"],
default="hf",
help="LLM backend (Qwen2.5-VL only supports HF)",
)
parser.add_argument(
"--model-name", default="Qwen/Qwen2.5-VL-7B-Instruct", help="Multimodal model name"
)
args = parser.parse_args() args = parser.parse_args()
@@ -615,12 +632,69 @@ def main():
evaluator.cleanup() evaluator.cleanup()
print("✅ Stage 4 completed!\n") print("✅ Stage 4 completed!\n")
if args.stage in ("5", "all"):
print("🚀 Starting Stage 5: Multimodal generation with Qwen2.5-VL")
evaluator = LAIONEvaluator(args.index)
captions = evaluator.load_queries(args.queries)
test_captions = captions[: min(20, len(captions))] # Use subset for generation
print(f"🧪 Testing multimodal generation with {len(test_captions)} queries")
# Load Qwen2.5-VL model
try:
print("Loading Qwen2.5-VL model...")
processor, model = load_qwen_vl_model(args.model_name)
# Run multimodal generation evaluation
complexity = args.complexity or 64
gen_results = evaluate_multimodal_rag(
evaluator.searcher,
test_captions,
processor=processor,
model=model,
complexity=complexity,
)
print("\n📊 Multimodal Generation Results:")
print(f" Total Queries: {len(test_captions)}")
print(f" Avg Search Time: {gen_results['avg_search_time']:.3f}s")
print(f" Avg Generation Time: {gen_results['avg_generation_time']:.3f}s")
total_time = gen_results["avg_search_time"] + gen_results["avg_generation_time"]
search_pct = (gen_results["avg_search_time"] / total_time) * 100
gen_pct = (gen_results["avg_generation_time"] / total_time) * 100
print(f" Time Distribution: Search {search_pct:.1f}%, Generation {gen_pct:.1f}%")
print(" LLM Backend: HuggingFace transformers")
print(f" Model: {args.model_name}")
# Show sample results
print("\n📝 Sample Multimodal Generations:")
for i, response in enumerate(gen_results["results"][:3]):
# Handle both string and dict formats for captions
if isinstance(test_captions[i], dict):
caption_text = test_captions[i].get("query", str(test_captions[i]))
else:
caption_text = str(test_captions[i])
print(f" Query {i + 1}: {caption_text[:60]}...")
print(f" Response {i + 1}: {response[:100]}...")
print()
except Exception as e:
print(f"❌ Multimodal generation evaluation failed: {e}")
print("💡 Make sure transformers and Qwen2.5-VL are installed")
import traceback
traceback.print_exc()
evaluator.cleanup()
print("✅ Stage 5 completed!\n")
if args.stage == "all": if args.stage == "all":
print("🎉 All evaluation stages completed successfully!") print("🎉 All evaluation stages completed successfully!")
print("\n📋 Summary:") print("\n📋 Summary:")
print(" Stage 2: ✅ Multimodal Recall@3 evaluation completed") print(" Stage 2: ✅ Multimodal Recall@3 evaluation completed")
print(" Stage 3: ✅ Optimal complexity found") print(" Stage 3: ✅ Optimal complexity found")
print(" Stage 4: ✅ Index comparison analysis completed") print(" Stage 4: ✅ Index comparison analysis completed")
print(" Stage 5: ✅ Multimodal generation evaluation completed")
print("\n🔧 Recommended next steps:") print("\n🔧 Recommended next steps:")
print(" - Use optimal complexity for best speed/accuracy balance") print(" - Use optimal complexity for best speed/accuracy balance")
print(" - Review index comparison for storage vs performance tradeoffs") print(" - Review index comparison for storage vs performance tradeoffs")

301
benchmarks/llm_utils.py Normal file
View File

@@ -0,0 +1,301 @@
"""
LLM utils for RAG benchmarks with Qwen3-8B and Qwen2.5-VL (multimodal)
"""
import time
try:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
HF_AVAILABLE = True
except ImportError:
HF_AVAILABLE = False
try:
from vllm import LLM, SamplingParams
VLLM_AVAILABLE = True
except ImportError:
VLLM_AVAILABLE = False
def is_qwen3_model(model_name):
"""Check if model is Qwen3"""
return "Qwen3" in model_name or "qwen3" in model_name.lower()
def is_qwen_vl_model(model_name):
"""Check if model is Qwen2.5-VL"""
return "Qwen2.5-VL" in model_name or "qwen2.5-vl" in model_name.lower()
def apply_qwen3_chat_template(tokenizer, prompt):
"""Apply Qwen3 chat template with thinking enabled"""
messages = [{"role": "user", "content": prompt}]
return tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True,
)
def extract_thinking_answer(response):
"""Extract final answer from Qwen3 thinking model response"""
if "<think>" in response and "</think>" in response:
try:
think_end = response.index("</think>") + len("</think>")
final_answer = response[think_end:].strip()
return final_answer
except (ValueError, IndexError):
pass
return response.strip()
def load_hf_model(model_name="Qwen/Qwen3-8B"):
"""Load HuggingFace model"""
if not HF_AVAILABLE:
raise ImportError("transformers not available")
print(f"Loading HF: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map="auto",
trust_remote_code=True,
)
return tokenizer, model
def load_vllm_model(model_name="Qwen/Qwen3-8B"):
"""Load vLLM model"""
if not VLLM_AVAILABLE:
raise ImportError("vllm not available")
print(f"Loading vLLM: {model_name}")
llm = LLM(model=model_name, trust_remote_code=True)
# Qwen3 specific config
if is_qwen3_model(model_name):
stop_tokens = ["<|im_end|>", "<|end_of_text|>"]
max_tokens = 2048
else:
stop_tokens = None
max_tokens = 1024
sampling_params = SamplingParams(temperature=0.7, max_tokens=max_tokens, stop=stop_tokens)
return llm, sampling_params
def generate_hf(tokenizer, model, prompt, max_tokens=None):
"""Generate with HF - supports Qwen3 thinking models"""
model_name = getattr(model, "name_or_path", "unknown")
is_qwen3 = is_qwen3_model(model_name)
# Apply chat template for Qwen3
if is_qwen3:
prompt = apply_qwen3_chat_template(tokenizer, prompt)
max_tokens = max_tokens or 2048
else:
max_tokens = max_tokens or 1024
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
response = response[len(prompt) :].strip()
# Extract final answer for thinking models
if is_qwen3:
return extract_thinking_answer(response)
return response
def generate_vllm(llm, sampling_params, prompt):
"""Generate with vLLM - supports Qwen3 thinking models"""
outputs = llm.generate([prompt], sampling_params)
response = outputs[0].outputs[0].text.strip()
# Extract final answer for Qwen3 thinking models
model_name = str(llm.llm_engine.model_config.model)
if is_qwen3_model(model_name):
return extract_thinking_answer(response)
return response
def create_prompt(context, query, domain="default"):
"""Create RAG prompt"""
if domain == "emails":
return f"Email content:\n{context}\n\nQuestion: {query}\n\nAnswer:"
elif domain == "finance":
return f"Financial content:\n{context}\n\nQuestion: {query}\n\nAnswer:"
elif domain == "multimodal":
return f"Image context:\n{context}\n\nQuestion: {query}\n\nAnswer:"
else:
return f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
def evaluate_rag(searcher, llm_func, queries, domain="default", top_k=3, complexity=64):
"""Simple RAG evaluation with timing"""
search_times = []
gen_times = []
results = []
for i, query in enumerate(queries):
# Search
start = time.time()
docs = searcher.search(query, top_k=top_k, complexity=complexity)
search_time = time.time() - start
# Generate
context = "\n\n".join([doc.text for doc in docs])
prompt = create_prompt(context, query, domain)
start = time.time()
response = llm_func(prompt)
gen_time = time.time() - start
search_times.append(search_time)
gen_times.append(gen_time)
results.append(response)
if i < 3:
print(f"Q{i + 1}: Search={search_time:.3f}s, Gen={gen_time:.3f}s")
return {
"avg_search_time": sum(search_times) / len(search_times),
"avg_generation_time": sum(gen_times) / len(gen_times),
"results": results,
}
def load_qwen_vl_model(model_name="Qwen/Qwen2.5-VL-7B-Instruct"):
"""Load Qwen2.5-VL multimodal model"""
if not HF_AVAILABLE:
raise ImportError("transformers not available")
print(f"Loading Qwen2.5-VL: {model_name}")
try:
from transformers import AutoModelForVision2Seq, AutoProcessor
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)
return processor, model
except Exception as e:
print(f"Failed to load with AutoModelForVision2Seq, trying specific class: {e}")
# Fallback to specific class
try:
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)
return processor, model
except Exception as e2:
raise ImportError(f"Failed to load Qwen2.5-VL model: {e2}")
def generate_qwen_vl(processor, model, prompt, image_path=None, max_tokens=512):
"""Generate with Qwen2.5-VL multimodal model"""
from PIL import Image
# Prepare inputs
if image_path:
image = Image.open(image_path)
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
else:
inputs = processor(text=prompt, return_tensors="pt").to(model.device)
# Generate
with torch.no_grad():
generated_ids = model.generate(
**inputs, max_new_tokens=max_tokens, do_sample=False, temperature=0.1
)
# Decode response
generated_ids = generated_ids[:, inputs["input_ids"].shape[1] :]
response = processor.decode(generated_ids[0], skip_special_tokens=True)
return response
def create_multimodal_prompt(context, query, image_descriptions, task_type="images"):
"""Create prompt for multimodal RAG"""
if task_type == "images":
return f"""Based on the retrieved images and their descriptions, answer the following question.
Retrieved Image Descriptions:
{context}
Question: {query}
Provide a detailed answer based on the visual content described above."""
return f"Context: {context}\nQuestion: {query}\nAnswer:"
def evaluate_multimodal_rag(searcher, queries, processor=None, model=None, complexity=64):
"""Evaluate multimodal RAG with Qwen2.5-VL"""
search_times = []
gen_times = []
results = []
for i, query_item in enumerate(queries):
# Handle both string and dict formats for queries
if isinstance(query_item, dict):
query = query_item.get("query", "")
image_path = query_item.get("image_path") # Optional reference image
else:
query = str(query_item)
image_path = None
# Search
start_time = time.time()
search_results = searcher.search(query, top_k=3, complexity=complexity)
search_time = time.time() - start_time
search_times.append(search_time)
# Prepare context from search results
context_parts = []
for result in search_results:
context_parts.append(f"- {result.text}")
context = "\n".join(context_parts)
# Generate with multimodal model
start_time = time.time()
if processor and model:
prompt = create_multimodal_prompt(context, query, context_parts)
response = generate_qwen_vl(processor, model, prompt, image_path)
else:
response = f"Context: {context}"
gen_time = time.time() - start_time
gen_times.append(gen_time)
results.append(response)
if i < 3:
print(f"Q{i + 1}: Search={search_time:.3f}s, Gen={gen_time:.3f}s")
return {
"avg_search_time": sum(search_times) / len(search_times),
"avg_generation_time": sum(gen_times) / len(gen_times),
"results": results,
}

View File

@@ -53,6 +53,7 @@ dependencies = [
"tree-sitter-java>=0.20.0", "tree-sitter-java>=0.20.0",
"tree-sitter-c-sharp>=0.20.0", "tree-sitter-c-sharp>=0.20.0",
"tree-sitter-typescript>=0.20.0", "tree-sitter-typescript>=0.20.0",
"torchvision>=0.23.0",
] ]
[project.optional-dependencies] [project.optional-dependencies]

42
uv.lock generated
View File

@@ -1564,7 +1564,7 @@ name = "importlib-metadata"
version = "8.7.0" version = "8.7.0"
source = { registry = "https://pypi.org/simple" } source = { registry = "https://pypi.org/simple" }
dependencies = [ dependencies = [
{ name = "zipp" }, { name = "zipp", marker = "python_full_version < '3.10'" },
] ]
sdist = { url = "https://files.pythonhosted.org/packages/76/66/650a33bd90f786193e4de4b3ad86ea60b53c89b669a5c7be931fac31cdb0/importlib_metadata-8.7.0.tar.gz", hash = "sha256:d13b81ad223b890aa16c5471f2ac3056cf76c5f10f82d6f9292f0b415f389000", size = 56641, upload-time = "2025-04-27T15:29:01.736Z" } sdist = { url = "https://files.pythonhosted.org/packages/76/66/650a33bd90f786193e4de4b3ad86ea60b53c89b669a5c7be931fac31cdb0/importlib_metadata-8.7.0.tar.gz", hash = "sha256:d13b81ad223b890aa16c5471f2ac3056cf76c5f10f82d6f9292f0b415f389000", size = 56641, upload-time = "2025-04-27T15:29:01.736Z" }
wheels = [ wheels = [
@@ -2257,6 +2257,7 @@ dependencies = [
{ name = "sentence-transformers" }, { name = "sentence-transformers" },
{ name = "sglang" }, { name = "sglang" },
{ name = "torch" }, { name = "torch" },
{ name = "torchvision" },
{ name = "tqdm" }, { name = "tqdm" },
{ name = "tree-sitter", version = "0.23.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" }, { name = "tree-sitter", version = "0.23.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" },
{ name = "tree-sitter", version = "0.25.1", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.10'" }, { name = "tree-sitter", version = "0.25.1", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.10'" },
@@ -2346,6 +2347,7 @@ requires-dist = [
{ name = "sentence-transformers", specifier = ">=2.2.0" }, { name = "sentence-transformers", specifier = ">=2.2.0" },
{ name = "sglang" }, { name = "sglang" },
{ name = "torch" }, { name = "torch" },
{ name = "torchvision", specifier = ">=0.23.0" },
{ name = "tqdm" }, { name = "tqdm" },
{ name = "tree-sitter", specifier = ">=0.20.0" }, { name = "tree-sitter", specifier = ">=0.20.0" },
{ name = "tree-sitter-c-sharp", specifier = ">=0.20.0" }, { name = "tree-sitter-c-sharp", specifier = ">=0.20.0" },
@@ -5870,6 +5872,44 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/ae/95/ae26263aceb3d57b821179f827d0e321373ed49423e603dd5906ab14a730/torch-2.8.0-cp39-none-macosx_11_0_arm64.whl", hash = "sha256:e9f071f5b52a9f6970dc8a919694b27a91ae9dc08898b2b988abbef5eddfd1ae", size = 73610795, upload-time = "2025-08-06T14:57:11.513Z" }, { url = "https://files.pythonhosted.org/packages/ae/95/ae26263aceb3d57b821179f827d0e321373ed49423e603dd5906ab14a730/torch-2.8.0-cp39-none-macosx_11_0_arm64.whl", hash = "sha256:e9f071f5b52a9f6970dc8a919694b27a91ae9dc08898b2b988abbef5eddfd1ae", size = 73610795, upload-time = "2025-08-06T14:57:11.513Z" },
] ]
[[package]]
name = "torchvision"
version = "0.23.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "numpy", version = "2.0.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" },
{ name = "numpy", version = "2.2.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version == '3.10.*'" },
{ name = "numpy", version = "2.3.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11'" },
{ name = "pillow" },
{ name = "torch" },
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/4d/49/5ad5c3ff4920be0adee9eb4339b4fb3b023a0fc55b9ed8dbc73df92946b8/torchvision-0.23.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:7266871daca00ad46d1c073e55d972179d12a58fa5c9adec9a3db9bbed71284a", size = 1856885, upload-time = "2025-08-06T14:57:55.024Z" },
{ url = "https://files.pythonhosted.org/packages/25/44/ddd56d1637bac42a8c5da2c8c440d8a28c431f996dd9790f32dd9a96ca6e/torchvision-0.23.0-cp310-cp310-manylinux_2_28_aarch64.whl", hash = "sha256:31c583ba27426a3a04eca8c05450524105c1564db41be6632f7536ef405a6de2", size = 2394251, upload-time = "2025-08-06T14:58:01.725Z" },
{ url = "https://files.pythonhosted.org/packages/93/f3/3cdf55bbf0f737304d997561c34ab0176222e0496b6743b0feab5995182c/torchvision-0.23.0-cp310-cp310-manylinux_2_28_x86_64.whl", hash = "sha256:3932bf67256f2d095ce90a9f826f6033694c818856f4bb26794cf2ce64253e53", size = 8627497, upload-time = "2025-08-06T14:58:09.317Z" },
{ url = "https://files.pythonhosted.org/packages/97/90/02afe57c3ef4284c5cf89d3b7ae203829b3a981f72b93a7dd2a3fd2c83c1/torchvision-0.23.0-cp310-cp310-win_amd64.whl", hash = "sha256:83ee5bf827d61a8af14620c0a61d8608558638ac9c3bac8adb7b27138e2147d1", size = 1600760, upload-time = "2025-08-06T14:57:56.783Z" },
{ url = "https://files.pythonhosted.org/packages/f0/d7/15d3d7bd8d0239211b21673d1bac7bc345a4ad904a8e25bb3fd8a9cf1fbc/torchvision-0.23.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:49aa20e21f0c2bd458c71d7b449776cbd5f16693dd5807195a820612b8a229b7", size = 1856884, upload-time = "2025-08-06T14:58:00.237Z" },
{ url = "https://files.pythonhosted.org/packages/dd/14/7b44fe766b7d11e064c539d92a172fa9689a53b69029e24f2f1f51e7dc56/torchvision-0.23.0-cp311-cp311-manylinux_2_28_aarch64.whl", hash = "sha256:01dc33ee24c79148aee7cdbcf34ae8a3c9da1674a591e781577b716d233b1fa6", size = 2395543, upload-time = "2025-08-06T14:58:04.373Z" },
{ url = "https://files.pythonhosted.org/packages/79/9c/fcb09aff941c8147d9e6aa6c8f67412a05622b0c750bcf796be4c85a58d4/torchvision-0.23.0-cp311-cp311-manylinux_2_28_x86_64.whl", hash = "sha256:35c27941831b653f5101edfe62c03d196c13f32139310519e8228f35eae0e96a", size = 8628388, upload-time = "2025-08-06T14:58:07.802Z" },
{ url = "https://files.pythonhosted.org/packages/93/40/3415d890eb357b25a8e0a215d32365a88ecc75a283f75c4e919024b22d97/torchvision-0.23.0-cp311-cp311-win_amd64.whl", hash = "sha256:09bfde260e7963a15b80c9e442faa9f021c7e7f877ac0a36ca6561b367185013", size = 1600741, upload-time = "2025-08-06T14:57:59.158Z" },
{ url = "https://files.pythonhosted.org/packages/df/1d/0ea0b34bde92a86d42620f29baa6dcbb5c2fc85990316df5cb8f7abb8ea2/torchvision-0.23.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:e0e2c04a91403e8dd3af9756c6a024a1d9c0ed9c0d592a8314ded8f4fe30d440", size = 1856885, upload-time = "2025-08-06T14:58:06.503Z" },
{ url = "https://files.pythonhosted.org/packages/e2/00/2f6454decc0cd67158c7890364e446aad4b91797087a57a78e72e1a8f8bc/torchvision-0.23.0-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:6dd7c4d329a0e03157803031bc856220c6155ef08c26d4f5bbac938acecf0948", size = 2396614, upload-time = "2025-08-06T14:58:03.116Z" },
{ url = "https://files.pythonhosted.org/packages/e4/b5/3e580dcbc16f39a324f3dd71b90edbf02a42548ad44d2b4893cc92b1194b/torchvision-0.23.0-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:4e7d31c43bc7cbecbb1a5652ac0106b436aa66e26437585fc2c4b2cf04d6014c", size = 8627108, upload-time = "2025-08-06T14:58:12.956Z" },
{ url = "https://files.pythonhosted.org/packages/82/c1/c2fe6d61e110a8d0de2f94276899a2324a8f1e6aee559eb6b4629ab27466/torchvision-0.23.0-cp312-cp312-win_amd64.whl", hash = "sha256:a2e45272abe7b8bf0d06c405e78521b5757be1bd0ed7e5cd78120f7fdd4cbf35", size = 1600723, upload-time = "2025-08-06T14:57:57.986Z" },
{ url = "https://files.pythonhosted.org/packages/91/37/45a5b9407a7900f71d61b2b2f62db4b7c632debca397f205fdcacb502780/torchvision-0.23.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:1c37e325e09a184b730c3ef51424f383ec5745378dc0eca244520aca29722600", size = 1856886, upload-time = "2025-08-06T14:58:05.491Z" },
{ url = "https://files.pythonhosted.org/packages/ac/da/a06c60fc84fc849377cf035d3b3e9a1c896d52dbad493b963c0f1cdd74d0/torchvision-0.23.0-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:2f7fd6c15f3697e80627b77934f77705f3bc0e98278b989b2655de01f6903e1d", size = 2353112, upload-time = "2025-08-06T14:58:26.265Z" },
{ url = "https://files.pythonhosted.org/packages/a0/27/5ce65ba5c9d3b7d2ccdd79892ab86a2f87ac2ca6638f04bb0280321f1a9c/torchvision-0.23.0-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:a76fafe113b2977be3a21bf78f115438c1f88631d7a87203acb3dd6ae55889e6", size = 8627658, upload-time = "2025-08-06T14:58:15.999Z" },
{ url = "https://files.pythonhosted.org/packages/1f/e4/028a27b60aa578a2fa99d9d7334ff1871bb17008693ea055a2fdee96da0d/torchvision-0.23.0-cp313-cp313-win_amd64.whl", hash = "sha256:07d069cb29691ff566e3b7f11f20d91044f079e1dbdc9d72e0655899a9b06938", size = 1600749, upload-time = "2025-08-06T14:58:10.719Z" },
{ url = "https://files.pythonhosted.org/packages/05/35/72f91ad9ac7c19a849dedf083d347dc1123f0adeb401f53974f84f1d04c8/torchvision-0.23.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:2df618e1143805a7673aaf82cb5720dd9112d4e771983156aaf2ffff692eebf9", size = 2047192, upload-time = "2025-08-06T14:58:11.813Z" },
{ url = "https://files.pythonhosted.org/packages/1d/9d/406cea60a9eb9882145bcd62a184ee61e823e8e1d550cdc3c3ea866a9445/torchvision-0.23.0-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:2a3299d2b1d5a7aed2d3b6ffb69c672ca8830671967eb1cee1497bacd82fe47b", size = 2359295, upload-time = "2025-08-06T14:58:17.469Z" },
{ url = "https://files.pythonhosted.org/packages/2b/f4/34662f71a70fa1e59de99772142f22257ca750de05ccb400b8d2e3809c1d/torchvision-0.23.0-cp313-cp313t-manylinux_2_28_x86_64.whl", hash = "sha256:76bc4c0b63d5114aa81281390f8472a12a6a35ce9906e67ea6044e5af4cab60c", size = 8800474, upload-time = "2025-08-06T14:58:22.53Z" },
{ url = "https://files.pythonhosted.org/packages/6e/f5/b5a2d841a8d228b5dbda6d524704408e19e7ca6b7bb0f24490e081da1fa1/torchvision-0.23.0-cp313-cp313t-win_amd64.whl", hash = "sha256:b9e2dabf0da9c8aa9ea241afb63a8f3e98489e706b22ac3f30416a1be377153b", size = 1527667, upload-time = "2025-08-06T14:58:14.446Z" },
{ url = "https://files.pythonhosted.org/packages/d5/3e/f1f3bb3dd452b98ec2eba4820d777440abceb3d3a428a6c8243006fe47e5/torchvision-0.23.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:b190db205f90206c230fc2f91cbdfd5733334babc0e0d19bddb90a40b8cf26c2", size = 1856927, upload-time = "2025-08-06T14:58:18.919Z" },
{ url = "https://files.pythonhosted.org/packages/f4/e2/aafc6af854e792d212ff58e459f8d5d807568dc3f2b49ec41b677275e5a9/torchvision-0.23.0-cp39-cp39-manylinux_2_28_aarch64.whl", hash = "sha256:6c74cbc1cbee26dd4f35f989cd80dccc40411f258dee476b29871dee4b483af0", size = 2392870, upload-time = "2025-08-06T14:58:21.303Z" },
{ url = "https://files.pythonhosted.org/packages/5d/06/09b6a917b3759ef000428af0aa2597f983e20d9fbbcfeb826750f778fe6d/torchvision-0.23.0-cp39-cp39-manylinux_2_28_x86_64.whl", hash = "sha256:a9e9d7552d34547b80843eaf64ab0737b19b2e8bec2514286b8cfd30861ca8b5", size = 8630400, upload-time = "2025-08-06T14:58:24.139Z" },
{ url = "https://files.pythonhosted.org/packages/08/07/ae46106efbf4bbc0090078aa4c406c38282cbe4e637bdb4b7f2e984140af/torchvision-0.23.0-cp39-cp39-win_amd64.whl", hash = "sha256:dc7ce5accbbb8c9df9a79f8cef6a6df042f28e2250a6ae0d2ca70b06473fa03b", size = 1600751, upload-time = "2025-08-06T14:58:20.027Z" },
]
[[package]] [[package]]
name = "tornado" name = "tornado"
version = "6.5.2" version = "6.5.2"