# FinanceBench Benchmark for LEANN-RAG FinanceBench is a benchmark for evaluating retrieval-augmented generation (RAG) systems on financial document question-answering tasks. ## Dataset - **Source**: [PatronusAI/financebench](https://huggingface.co/datasets/PatronusAI/financebench) - **Questions**: 150 financial Q&A examples - **Documents**: 368 PDF files (10-K, 10-Q, 8-K, earnings reports) - **Companies**: Major public companies (3M, Apple, Microsoft, Amazon, etc.) - **Paper**: [FinanceBench: A New Benchmark for Financial Question Answering](https://arxiv.org/abs/2311.11944) ## Structure ``` benchmarks/financebench/ ├── setup_financebench.py # Downloads PDFs and builds index ├── evaluate_financebench.py # Intelligent evaluation script ├── data/ │ ├── financebench_merged.jsonl # Q&A dataset │ ├── pdfs/ # Downloaded financial documents │ └── index/ # LEANN indexes │ └── financebench_full_hnsw.leann └── README.md ``` ## Usage ### 1. Setup (Download & Build Index) ```bash cd benchmarks/financebench python setup_financebench.py ``` This will: - Download the 150 Q&A examples - Download all 368 PDF documents (parallel processing) - Build a LEANN index from 53K+ text chunks - Verify setup with test query ### 2. Evaluation ```bash # Basic retrieval evaluation python evaluate_financebench.py --index data/index/financebench_full_hnsw.leann # RAG generation evaluation with Qwen3-8B python evaluate_financebench.py --index data/index/financebench_full_hnsw.leann --stage 4 --complexity 64 --llm-backend hf --model-name Qwen/Qwen3-8B --output results_qwen3.json ``` ## Evaluation Methods ### Retrieval Evaluation Uses intelligent matching with three strategies: 1. **Exact text overlap** - Direct substring matches 2. **Number matching** - Key financial figures ($1,577, 1.2B, etc.) 3. **Semantic similarity** - Word overlap with 20% threshold ### QA Evaluation LLM-based answer evaluation using GPT-4o: - Handles numerical rounding and equivalent representations - Considers fractions, percentages, and decimal equivalents - Evaluates semantic meaning rather than exact text match ## Benchmark Results ### LEANN-RAG Performance (sentence-transformers/all-mpnet-base-v2) **Retrieval Metrics:** - **Question Coverage**: 100.0% (all questions retrieve relevant docs) - **Exact Match Rate**: 0.7% (substring overlap with evidence) - **Number Match Rate**: 120.7% (key financial figures matched)* - **Semantic Match Rate**: 4.7% (word overlap ≥20%) - **Average Search Time**: 0.097s **QA Metrics:** - **Accuracy**: 42.7% (LLM-evaluated answer correctness) - **Average QA Time**: 4.71s (end-to-end response time) **System Performance:** - **Index Size**: 53,985 chunks from 368 PDFs - **Build Time**: ~5-10 minutes with sentence-transformers/all-mpnet-base-v2 *Note: Number match rate >100% indicates multiple retrieved documents contain the same financial figures, which is expected behavior for financial data appearing across multiple document sections. ### LEANN-RAG Generation Performance (Qwen3-8B) - **Stage 4 (Index Comparison):** - Compact Index: 5.0 MB - Non-compact Index: 172.2 MB - **Storage Saving**: 97.1% - **Search Performance**: - Non-compact (no recompute): 0.009s avg per query - Compact (with recompute): 2.203s avg per query - Speed ratio: 0.004x **Generation Evaluation (20 queries, complexity=64):** - **Average Search Time**: 1.638s per query - **Average Generation Time**: 45.957s per query - **LLM Backend**: HuggingFace transformers - **Model**: Qwen/Qwen3-8B (thinking model with processing) - **Total Questions Processed**: 20 ## Options ```bash # Use different backends python setup_financebench.py --backend diskann python evaluate_financebench.py --index data/index/financebench_full_diskann.leann # Use different embedding models python setup_financebench.py --embedding-model facebook/contriever ```