Files
LEANN/benchmarks
Andy Lee fecee94af1 Experiments (#68)
* feat: finance bench

* docs: results

* chore: ignroe data README

* feat: fix financebench

* feat: laion, also required idmaps support

* style: format

* style: format

* fix: resolve ruff linting errors

- Remove unused variables in benchmark scripts
- Rename unused loop variables to follow convention

* feat: enron email bench

* experiments for running DiskANN & BM25 on Arch 4090

* style: format

* chore(ci): remove paru-bin submodule and config to fix checkout --recurse-submodules

* docs: data

* docs: data updated

* fix: as package

* fix(ci): only run pre-commit

* chore: use http url of astchunk; use group for some dev deps

* fix(ci): should checkout modules as well since `uv sync` checks

* fix(ci): run with lint only

* fix: find links to install wheels available

* CI: force local wheels in uv install step

* CI: install local wheels via file paths

* CI: pick wheels matching current Python tag

* CI: handle python tag mismatches for local wheels

* CI: use matrix python venv and set macOS deployment target

* CI: revert install step to match main

* CI: use uv group install with local wheel selection

* CI: rely on setup-uv for Python and tighten group install

* CI: install build deps with uv python interpreter

* CI: use temporary uv venv for build deps

* CI: add build venv scripts path for wheel repair
2025-09-24 11:19:04 -07:00
..
2025-09-24 11:19:04 -07:00
2025-09-24 11:19:04 -07:00
2025-09-24 11:19:04 -07:00
2025-09-24 11:19:04 -07:00
2025-09-24 11:19:04 -07:00
2025-09-24 11:19:04 -07:00
2025-09-24 11:19:04 -07:00

🧪 LEANN Benchmarks & Testing

This directory contains performance benchmarks and comprehensive tests for the LEANN system, including backend comparisons and sanity checks across different configurations.

📁 Test Files

diskann_vs_hnsw_speed_comparison.py

Performance comparison between DiskANN and HNSW backends:

  • Search latency comparison with both backends using recompute
  • Index size and build time measurements
  • Score validity testing (ensures no -inf scores)
  • Configurable dataset sizes for different scales
# Quick comparison with 500 docs, 10 queries
python benchmarks/diskann_vs_hnsw_speed_comparison.py

# Large-scale comparison with 2000 docs, 20 queries
python benchmarks/diskann_vs_hnsw_speed_comparison.py 2000 20

test_distance_functions.py

Tests all supported distance functions across DiskANN backend:

  • MIPS (Maximum Inner Product Search)
  • L2 (Euclidean Distance)
  • Cosine (Cosine Similarity)
uv run python tests/sanity_checks/test_distance_functions.py

test_l2_verification.py

Specifically verifies that L2 distance is correctly implemented by:

  • Building indices with L2 vs Cosine metrics
  • Comparing search results and score ranges
  • Validating that different metrics produce expected score patterns
uv run python tests/sanity_checks/test_l2_verification.py

test_sanity_check.py

Comprehensive end-to-end verification including:

  • Distance function testing
  • Embedding model compatibility
  • Search result correctness validation
  • Backend integration testing
uv run python tests/sanity_checks/test_sanity_check.py

🎯 What These Tests Verify

Distance Function Support

  • All three distance metrics (MIPS, L2, Cosine) work correctly
  • Score ranges are appropriate for each metric type
  • Different metrics can produce different rankings (as expected)

Backend Integration

  • DiskANN backend properly initializes and builds indices
  • Graph construction completes without errors
  • Search operations return valid results

Embedding Pipeline

  • Real-time embedding computation works
  • Multiple embedding models are supported
  • ZMQ server communication functions correctly

End-to-End Functionality

  • Index building → searching → result retrieval pipeline
  • Metadata preservation through the entire flow
  • Error handling and graceful degradation

🔍 Expected Output

When all tests pass, you should see:

📊 测试结果总结:
  mips      : ✅ 通过
  l2        : ✅ 通过
  cosine    : ✅ 通过

🎉 测试完成!

🐛 Troubleshooting

Common Issues

Import Errors: Ensure you're running from the project root:

cd /path/to/leann
uv run python tests/sanity_checks/test_distance_functions.py

Memory Issues: Reduce graph complexity for resource-constrained systems:

builder = LeannBuilder(
    backend_name="diskann",
    graph_degree=8,  # Reduced from 16
    complexity=16    # Reduced from 32
)

ZMQ Port Conflicts: The tests use different ports to avoid conflicts, but you may need to kill existing processes:

pkill -f "embedding_server"

📊 Performance Expectations

Typical Timing (3 documents, consumer hardware):

  • Index Building: 2-5 seconds per distance function
  • Search Query: 50-200ms
  • Recompute Mode: 5-15 seconds (higher accuracy)

Memory Usage:

  • Index Storage: ~1-2 MB per distance function
  • Runtime Memory: ~500MB (including model loading)

🔗 Integration with CI/CD

These tests are designed to be run in automated environments:

# GitHub Actions example
- name: Run Sanity Checks
  run: |
    uv run python tests/sanity_checks/test_distance_functions.py
    uv run python tests/sanity_checks/test_l2_verification.py

The tests are deterministic and should produce consistent results across different platforms.