- Fix import order in packages/leann-backend-hnsw/leann_backend_hnsw/hnsw_backend.py (Ruff I001) - Update benchmarks/run_evaluation.py - Update apps/base_rag_example.py and leann-core API usage - Add benchmarks/data/README.md - Update uv.lock - Misc cleanup - Note: added paru-bin as an embedded git repo; consider making it a submodule (git rm --cached paru-bin) if unintended
license
| license |
|---|
| mit |
LEANN-RAG Evaluation Data
This repository contains the necessary data to run the recall evaluation scripts for the LEANN-RAG project.
Dataset Components
This dataset is structured into three main parts:
-
Pre-built LEANN Indices:
dpr/: A pre-built index for the DPR dataset.rpj_wiki/: A pre-built index for the RPJ-Wiki dataset. These indices were created using theleann-corelibrary and are required by theLeannSearcher.
-
Ground Truth Data:
ground_truth/: Contains the ground truth files (flat_results_nq_k3.json) for both the DPR and RPJ-Wiki datasets. These files map queries to the original passage IDs from the Natural Questions benchmark, evaluated using the Contriever model.
-
Queries:
queries/: Contains thenq_open.jsonlfile with the Natural Questions queries used for the evaluation.
Usage
To use this data, you can download it locally using the huggingface-hub library. First, install the library:
pip install huggingface-hub
Then, you can download the entire dataset to a local directory (e.g., data/) with the following Python script:
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="LEANN-RAG/leann-rag-evaluation-data",
repo_type="dataset",
local_dir="data"
)
This will download all the necessary files into a local data folder, preserving the repository structure. The evaluation scripts in the main LEANN-RAG Space are configured to work with this data structure.