* chore(hnsw): reorder imports to satisfy ruff I001 * chore: sync changes; fix Ruff import order; update examples, benchmarks, and dependencies - Fix import order in packages/leann-backend-hnsw/leann_backend_hnsw/hnsw_backend.py (Ruff I001) - Update benchmarks/run_evaluation.py - Update apps/base_rag_example.py and leann-core API usage - Add benchmarks/data/README.md - Update uv.lock - Misc cleanup - Note: added paru-bin as an embedded git repo; consider making it a submodule (git rm --cached paru-bin) if unintended * chore: remove unintended embedded repo paru-bin and ignore it Fix CI: avoid missing .gitmodules entry by removing gitlink and adding to .gitignore. * ci: retrigger after removing unintended gitlink (paru-bin) * feat(benchmarks): add --batch-size option and plumb through to HNSW search (default 0) * feat(hnsw): add batch_size to LeannSearcher.search and LeannChat.ask; forward only for HNSW backend * chore(logging): surface recompute and batching params; enable INFO logging in benchmark * feat(embeddings): add optional manual tokenization path (HF tokenizer+model) with mean pooling; default remains SentenceTransformer.encode * fix micro bench and fix pre commit * update readme --------- Co-authored-by: yichuan-w <yichuan-w@users.noreply.github.com>
license
| license |
|---|
| mit |
LEANN-RAG Evaluation Data
This repository contains the necessary data to run the recall evaluation scripts for the LEANN-RAG project.
Dataset Components
This dataset is structured into three main parts:
-
Pre-built LEANN Indices:
dpr/: A pre-built index for the DPR dataset.rpj_wiki/: A pre-built index for the RPJ-Wiki dataset. These indices were created using theleann-corelibrary and are required by theLeannSearcher.
-
Ground Truth Data:
ground_truth/: Contains the ground truth files (flat_results_nq_k3.json) for both the DPR and RPJ-Wiki datasets. These files map queries to the original passage IDs from the Natural Questions benchmark, evaluated using the Contriever model.
-
Queries:
queries/: Contains thenq_open.jsonlfile with the Natural Questions queries used for the evaluation.
Usage
To use this data, you can download it locally using the huggingface-hub library. First, install the library:
pip install huggingface-hub
Then, you can download the entire dataset to a local directory (e.g., data/) with the following Python script:
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="LEANN-RAG/leann-rag-evaluation-data",
repo_type="dataset",
local_dir="data"
)
This will download all the necessary files into a local data folder, preserving the repository structure. The evaluation scripts in the main LEANN-RAG Space are configured to work with this data structure.