* feat: finance bench * docs: results * chore: ignroe data README * feat: fix financebench * feat: laion, also required idmaps support * style: format * style: format * fix: resolve ruff linting errors - Remove unused variables in benchmark scripts - Rename unused loop variables to follow convention * feat: enron email bench * experiments for running DiskANN & BM25 on Arch 4090 * style: format * chore(ci): remove paru-bin submodule and config to fix checkout --recurse-submodules * docs: data * docs: data updated * fix: as package * fix(ci): only run pre-commit * chore: use http url of astchunk; use group for some dev deps * fix(ci): should checkout modules as well since `uv sync` checks * fix(ci): run with lint only * fix: find links to install wheels available * CI: force local wheels in uv install step * CI: install local wheels via file paths * CI: pick wheels matching current Python tag * CI: handle python tag mismatches for local wheels * CI: use matrix python venv and set macOS deployment target * CI: revert install step to match main * CI: use uv group install with local wheel selection * CI: rely on setup-uv for Python and tighten group install * CI: install build deps with uv python interpreter * CI: use temporary uv venv for build deps * CI: add build venv scripts path for wheel repair
116 lines
3.9 KiB
Markdown
116 lines
3.9 KiB
Markdown
# FinanceBench Benchmark for LEANN-RAG
|
|
|
|
FinanceBench is a benchmark for evaluating retrieval-augmented generation (RAG) systems on financial document question-answering tasks.
|
|
|
|
## Dataset
|
|
|
|
- **Source**: [PatronusAI/financebench](https://huggingface.co/datasets/PatronusAI/financebench)
|
|
- **Questions**: 150 financial Q&A examples
|
|
- **Documents**: 368 PDF files (10-K, 10-Q, 8-K, earnings reports)
|
|
- **Companies**: Major public companies (3M, Apple, Microsoft, Amazon, etc.)
|
|
- **Paper**: [FinanceBench: A New Benchmark for Financial Question Answering](https://arxiv.org/abs/2311.11944)
|
|
|
|
## Structure
|
|
|
|
```
|
|
benchmarks/financebench/
|
|
├── setup_financebench.py # Downloads PDFs and builds index
|
|
├── evaluate_financebench.py # Intelligent evaluation script
|
|
├── data/
|
|
│ ├── financebench_merged.jsonl # Q&A dataset
|
|
│ ├── pdfs/ # Downloaded financial documents
|
|
│ └── index/ # LEANN indexes
|
|
│ └── financebench_full_hnsw.leann
|
|
└── README.md
|
|
```
|
|
|
|
## Usage
|
|
|
|
### 1. Setup (Download & Build Index)
|
|
|
|
```bash
|
|
cd benchmarks/financebench
|
|
python setup_financebench.py
|
|
```
|
|
|
|
This will:
|
|
- Download the 150 Q&A examples
|
|
- Download all 368 PDF documents (parallel processing)
|
|
- Build a LEANN index from 53K+ text chunks
|
|
- Verify setup with test query
|
|
|
|
### 2. Evaluation
|
|
|
|
```bash
|
|
# Basic retrieval evaluation
|
|
python evaluate_financebench.py --index data/index/financebench_full_hnsw.leann
|
|
|
|
|
|
# RAG generation evaluation with Qwen3-8B
|
|
python evaluate_financebench.py --index data/index/financebench_full_hnsw.leann --stage 4 --complexity 64 --llm-backend hf --model-name Qwen/Qwen3-8B --output results_qwen3.json
|
|
```
|
|
|
|
## Evaluation Methods
|
|
|
|
### Retrieval Evaluation
|
|
Uses intelligent matching with three strategies:
|
|
1. **Exact text overlap** - Direct substring matches
|
|
2. **Number matching** - Key financial figures ($1,577, 1.2B, etc.)
|
|
3. **Semantic similarity** - Word overlap with 20% threshold
|
|
|
|
### QA Evaluation
|
|
LLM-based answer evaluation using GPT-4o:
|
|
- Handles numerical rounding and equivalent representations
|
|
- Considers fractions, percentages, and decimal equivalents
|
|
- Evaluates semantic meaning rather than exact text match
|
|
|
|
## Benchmark Results
|
|
|
|
### LEANN-RAG Performance (sentence-transformers/all-mpnet-base-v2)
|
|
|
|
**Retrieval Metrics:**
|
|
- **Question Coverage**: 100.0% (all questions retrieve relevant docs)
|
|
- **Exact Match Rate**: 0.7% (substring overlap with evidence)
|
|
- **Number Match Rate**: 120.7% (key financial figures matched)*
|
|
- **Semantic Match Rate**: 4.7% (word overlap ≥20%)
|
|
- **Average Search Time**: 0.097s
|
|
|
|
**QA Metrics:**
|
|
- **Accuracy**: 42.7% (LLM-evaluated answer correctness)
|
|
- **Average QA Time**: 4.71s (end-to-end response time)
|
|
|
|
**System Performance:**
|
|
- **Index Size**: 53,985 chunks from 368 PDFs
|
|
- **Build Time**: ~5-10 minutes with sentence-transformers/all-mpnet-base-v2
|
|
|
|
*Note: Number match rate >100% indicates multiple retrieved documents contain the same financial figures, which is expected behavior for financial data appearing across multiple document sections.
|
|
|
|
### LEANN-RAG Generation Performance (Qwen3-8B)
|
|
|
|
- **Stage 4 (Index Comparison):**
|
|
- Compact Index: 5.0 MB
|
|
- Non-compact Index: 172.2 MB
|
|
- **Storage Saving**: 97.1%
|
|
- **Search Performance**:
|
|
- Non-compact (no recompute): 0.009s avg per query
|
|
- Compact (with recompute): 2.203s avg per query
|
|
- Speed ratio: 0.004x
|
|
|
|
**Generation Evaluation (20 queries, complexity=64):**
|
|
- **Average Search Time**: 1.638s per query
|
|
- **Average Generation Time**: 45.957s per query
|
|
- **LLM Backend**: HuggingFace transformers
|
|
- **Model**: Qwen/Qwen3-8B (thinking model with <think></think> processing)
|
|
- **Total Questions Processed**: 20
|
|
|
|
## Options
|
|
|
|
```bash
|
|
# Use different backends
|
|
python setup_financebench.py --backend diskann
|
|
python evaluate_financebench.py --index data/index/financebench_full_diskann.leann
|
|
|
|
# Use different embedding models
|
|
python setup_financebench.py --embedding-model facebook/contriever
|
|
```
|