Files

Andy Lee fecee94af1 Experiments (#68 )

* feat: finance bench

* docs: results

* chore: ignroe data README

* feat: fix financebench

* feat: laion, also required idmaps support

* style: format

* style: format

* fix: resolve ruff linting errors

- Remove unused variables in benchmark scripts
- Rename unused loop variables to follow convention

* feat: enron email bench

* experiments for running DiskANN & BM25 on Arch 4090

* style: format

* chore(ci): remove paru-bin submodule and config to fix checkout --recurse-submodules

* docs: data

* docs: data updated

* fix: as package

* fix(ci): only run pre-commit

* chore: use http url of astchunk; use group for some dev deps

* fix(ci): should checkout modules as well since `uv sync` checks

* fix(ci): run with lint only

* fix: find links to install wheels available

* CI: force local wheels in uv install step

* CI: install local wheels via file paths

* CI: pick wheels matching current Python tag

* CI: handle python tag mismatches for local wheels

* CI: use matrix python venv and set macOS deployment target

* CI: revert install step to match main

* CI: use uv group install with local wheel selection

* CI: rely on setup-uv for Python and tighten group install

* CI: install build deps with uv python interpreter

* CI: use temporary uv venv for build deps

* CI: add build venv scripts path for wheel repair

2025-09-24 11:19:04 -07:00

evaluate_financebench.py

Experiments (#68 )

2025-09-24 11:19:04 -07:00

README.md

Experiments (#68 )

2025-09-24 11:19:04 -07:00

setup_financebench.py

Experiments (#68 )

2025-09-24 11:19:04 -07:00

verify_recall.py

Experiments (#68 )

2025-09-24 11:19:04 -07:00

README.md

FinanceBench Benchmark for LEANN-RAG

FinanceBench is a benchmark for evaluating retrieval-augmented generation (RAG) systems on financial document question-answering tasks.

Dataset

Source: PatronusAI/financebench
Questions: 150 financial Q&A examples
Documents: 368 PDF files (10-K, 10-Q, 8-K, earnings reports)
Companies: Major public companies (3M, Apple, Microsoft, Amazon, etc.)
Paper: FinanceBench: A New Benchmark for Financial Question Answering

Structure

benchmarks/financebench/
├── setup_financebench.py        # Downloads PDFs and builds index
├── evaluate_financebench.py     # Intelligent evaluation script
├── data/
│   ├── financebench_merged.jsonl     # Q&A dataset
│   ├── pdfs/                         # Downloaded financial documents
│   └── index/                        # LEANN indexes
│       └── financebench_full_hnsw.leann
└── README.md

Usage

1. Setup (Download & Build Index)

cd benchmarks/financebench
python setup_financebench.py

This will:

Download the 150 Q&A examples
Download all 368 PDF documents (parallel processing)
Build a LEANN index from 53K+ text chunks
Verify setup with test query

2. Evaluation

# Basic retrieval evaluation
python evaluate_financebench.py --index data/index/financebench_full_hnsw.leann


# RAG generation evaluation with Qwen3-8B
python evaluate_financebench.py --index data/index/financebench_full_hnsw.leann --stage 4 --complexity 64 --llm-backend hf --model-name Qwen/Qwen3-8B --output results_qwen3.json

Evaluation Methods

Retrieval Evaluation

Uses intelligent matching with three strategies:

Exact text overlap - Direct substring matches
Number matching - Key financial figures ($1,577, 1.2B, etc.)
Semantic similarity - Word overlap with 20% threshold

QA Evaluation

LLM-based answer evaluation using GPT-4o:

Handles numerical rounding and equivalent representations
Considers fractions, percentages, and decimal equivalents
Evaluates semantic meaning rather than exact text match

Benchmark Results

LEANN-RAG Performance (sentence-transformers/all-mpnet-base-v2)

Retrieval Metrics:

Question Coverage: 100.0% (all questions retrieve relevant docs)
Exact Match Rate: 0.7% (substring overlap with evidence)
Number Match Rate: 120.7% (key financial figures matched)*
Semantic Match Rate: 4.7% (word overlap ≥20%)
Average Search Time: 0.097s

QA Metrics:

Accuracy: 42.7% (LLM-evaluated answer correctness)
Average QA Time: 4.71s (end-to-end response time)

System Performance:

Index Size: 53,985 chunks from 368 PDFs
Build Time: ~5-10 minutes with sentence-transformers/all-mpnet-base-v2

*Note: Number match rate >100% indicates multiple retrieved documents contain the same financial figures, which is expected behavior for financial data appearing across multiple document sections.

LEANN-RAG Generation Performance (Qwen3-8B)

Stage 4 (Index Comparison):
- Compact Index: 5.0 MB
- Non-compact Index: 172.2 MB
- Storage Saving: 97.1%
Search Performance:
- Non-compact (no recompute): 0.009s avg per query
- Compact (with recompute): 2.203s avg per query
- Speed ratio: 0.004x

Generation Evaluation (20 queries, complexity=64):

Average Search Time: 1.638s per query
Average Generation Time: 45.957s per query
LLM Backend: HuggingFace transformers
Model: Qwen/Qwen3-8B (thinking model with processing)
Total Questions Processed: 20

Options

# Use different backends
python setup_financebench.py --backend diskann
python evaluate_financebench.py --index data/index/financebench_full_diskann.leann

# Use different embedding models
python setup_financebench.py --embedding-model facebook/contriever