Files
LEANN/benchmarks/financebench/README.md
2025-08-16 16:48:01 -07:00

3.3 KiB

FinanceBench Benchmark for LEANN-RAG

FinanceBench is a benchmark for evaluating retrieval-augmented generation (RAG) systems on financial document question-answering tasks.

Dataset

Structure

benchmarks/financebench/
├── setup_financebench.py        # Downloads PDFs and builds index
├── evaluate_financebench.py     # Intelligent evaluation script
├── data/
│   ├── financebench_merged.jsonl     # Q&A dataset
│   ├── pdfs/                         # Downloaded financial documents
│   └── index/                        # LEANN indexes
│       └── financebench_full_hnsw.leann
└── README.md

Usage

1. Setup (Download & Build Index)

cd benchmarks/financebench
python setup_financebench.py

This will:

  • Download the 150 Q&A examples
  • Download all 368 PDF documents (parallel processing)
  • Build a LEANN index from 53K+ text chunks
  • Verify setup with test query

2. Evaluation

# Basic retrieval evaluation
python evaluate_financebench.py --index data/index/financebench_full_hnsw.leann

# Include QA evaluation with OpenAI
export OPENAI_API_KEY="your-key"
python evaluate_financebench.py --index data/index/financebench_full_hnsw.leann --qa-samples 20

Evaluation Methods

Retrieval Evaluation

Uses intelligent matching with three strategies:

  1. Exact text overlap - Direct substring matches
  2. Number matching - Key financial figures ($1,577, 1.2B, etc.)
  3. Semantic similarity - Word overlap with 20% threshold

QA Evaluation

LLM-based answer evaluation using GPT-4o:

  • Handles numerical rounding and equivalent representations
  • Considers fractions, percentages, and decimal equivalents
  • Evaluates semantic meaning rather than exact text match

Benchmark Results

LEANN-RAG Performance (sentence-transformers/all-mpnet-base-v2)

Retrieval Metrics:

  • Question Coverage: 100.0% (all questions retrieve relevant docs)
  • Exact Match Rate: 0.7% (substring overlap with evidence)
  • Number Match Rate: 120.7% (key financial figures matched)*
  • Semantic Match Rate: 4.7% (word overlap ≥20%)
  • Average Search Time: 0.097s

QA Metrics:

  • Accuracy: 42.7% (LLM-evaluated answer correctness)
  • Average QA Time: 4.71s (end-to-end response time)

System Performance:

  • Index Size: 53,985 chunks from 368 PDFs
  • Build Time: ~5-10 minutes with sentence-transformers/all-mpnet-base-v2

*Note: Number match rate >100% indicates multiple retrieved documents contain the same financial figures, which is expected behavior for financial data appearing across multiple document sections.

Options

# Use different backends
python setup_financebench.py --backend diskann
python evaluate_financebench.py --index data/index/financebench_full_diskann.leann

# Use different embedding models
python setup_financebench.py --embedding-model facebook/contriever