fix readme

2025-10-08 21:38:55 +00:00
parent 3ec5e8d035
commit 5be0c144ad
72 changed files with 16608 additions and 4175 deletions
--- a/benchmarks/financebench/README.md
+++ b/benchmarks/financebench/README.md
@@ -0,0 +1,115 @@
+# FinanceBench Benchmark for LEANN-RAG
+
+FinanceBench is a benchmark for evaluating retrieval-augmented generation (RAG) systems on financial document question-answering tasks.
+
+## Dataset
+
+- **Source**: [PatronusAI/financebench](https://huggingface.co/datasets/PatronusAI/financebench)
+- **Questions**: 150 financial Q&A examples
+- **Documents**: 368 PDF files (10-K, 10-Q, 8-K, earnings reports)
+- **Companies**: Major public companies (3M, Apple, Microsoft, Amazon, etc.)
+- **Paper**: [FinanceBench: A New Benchmark for Financial Question Answering](https://arxiv.org/abs/2311.11944)
+
+## Structure
+
+```
+benchmarks/financebench/
+├── setup_financebench.py        # Downloads PDFs and builds index
+├── evaluate_financebench.py     # Intelligent evaluation script
+├── data/
+│   ├── financebench_merged.jsonl     # Q&A dataset
+│   ├── pdfs/                         # Downloaded financial documents
+│   └── index/                        # LEANN indexes
+│       └── financebench_full_hnsw.leann
+└── README.md
+```
+
+## Usage
+
+### 1. Setup (Download & Build Index)
+
+```bash
+cd benchmarks/financebench
+python setup_financebench.py
+```
+
+This will:
+- Download the 150 Q&A examples
+- Download all 368 PDF documents (parallel processing)
+- Build a LEANN index from 53K+ text chunks
+- Verify setup with test query
+
+### 2. Evaluation
+
+```bash
+# Basic retrieval evaluation
+python evaluate_financebench.py --index data/index/financebench_full_hnsw.leann
+
+
+# RAG generation evaluation with Qwen3-8B
+python evaluate_financebench.py --index data/index/financebench_full_hnsw.leann --stage 4 --complexity 64 --llm-backend hf --model-name Qwen/Qwen3-8B --output results_qwen3.json
+```
+
+## Evaluation Methods
+
+### Retrieval Evaluation
+Uses intelligent matching with three strategies:
+1. **Exact text overlap** - Direct substring matches
+2. **Number matching** - Key financial figures ($1,577, 1.2B, etc.)
+3. **Semantic similarity** - Word overlap with 20% threshold
+
+### QA Evaluation
+LLM-based answer evaluation using GPT-4o:
+- Handles numerical rounding and equivalent representations
+- Considers fractions, percentages, and decimal equivalents
+- Evaluates semantic meaning rather than exact text match
+
+## Benchmark Results
+
+### LEANN-RAG Performance (sentence-transformers/all-mpnet-base-v2)
+
+**Retrieval Metrics:**
+- **Question Coverage**: 100.0% (all questions retrieve relevant docs)
+- **Exact Match Rate**: 0.7% (substring overlap with evidence)
+- **Number Match Rate**: 120.7% (key financial figures matched)*
+- **Semantic Match Rate**: 4.7% (word overlap ≥20%)
+- **Average Search Time**: 0.097s
+
+**QA Metrics:**
+- **Accuracy**: 42.7% (LLM-evaluated answer correctness)
+- **Average QA Time**: 4.71s (end-to-end response time)
+
+**System Performance:**
+- **Index Size**: 53,985 chunks from 368 PDFs
+- **Build Time**: ~5-10 minutes with sentence-transformers/all-mpnet-base-v2
+
+*Note: Number match rate >100% indicates multiple retrieved documents contain the same financial figures, which is expected behavior for financial data appearing across multiple document sections.
+
+### LEANN-RAG Generation Performance (Qwen3-8B)
+
+- **Stage 4 (Index Comparison):**
+  - Compact Index: 5.0 MB
+  - Non-compact Index: 172.2 MB
+  - **Storage Saving**: 97.1%
+- **Search Performance**:
+  - Non-compact (no recompute): 0.009s avg per query
+  - Compact (with recompute): 2.203s avg per query
+  - Speed ratio: 0.004x
+
+**Generation Evaluation (20 queries, complexity=64):**
+- **Average Search Time**: 1.638s per query
+- **Average Generation Time**: 45.957s per query
+- **LLM Backend**: HuggingFace transformers
+- **Model**: Qwen/Qwen3-8B (thinking model with <think></think> processing)
+- **Total Questions Processed**: 20
+
+## Options
+
+```bash
+# Use different backends
+python setup_financebench.py --backend diskann
+python evaluate_financebench.py --index data/index/financebench_full_diskann.leann
+
+# Use different embedding models
+python setup_financebench.py --embedding-model facebook/contriever
+```