Files

Andy Lee fecee94af1 Experiments (#68 )

* feat: finance bench

* docs: results

* chore: ignroe data README

* feat: fix financebench

* feat: laion, also required idmaps support

* style: format

* style: format

* fix: resolve ruff linting errors

- Remove unused variables in benchmark scripts
- Rename unused loop variables to follow convention

* feat: enron email bench

* experiments for running DiskANN & BM25 on Arch 4090

* style: format

* chore(ci): remove paru-bin submodule and config to fix checkout --recurse-submodules

* docs: data

* docs: data updated

* fix: as package

* fix(ci): only run pre-commit

* chore: use http url of astchunk; use group for some dev deps

* fix(ci): should checkout modules as well since `uv sync` checks

* fix(ci): run with lint only

* fix: find links to install wheels available

* CI: force local wheels in uv install step

* CI: install local wheels via file paths

* CI: pick wheels matching current Python tag

* CI: handle python tag mismatches for local wheels

* CI: use matrix python venv and set macOS deployment target

* CI: revert install step to match main

* CI: use uv group install with local wheel selection

* CI: rely on setup-uv for Python and tighten group install

* CI: install build deps with uv python interpreter

* CI: use temporary uv venv for build deps

* CI: add build venv scripts path for wheel repair

2025-09-24 11:19:04 -07:00

.gitignore

Experiments (#68 )

2025-09-24 11:19:04 -07:00

evaluate_laion.py

Experiments (#68 )

2025-09-24 11:19:04 -07:00

README.md

Experiments (#68 )

2025-09-24 11:19:04 -07:00

setup_laion.py

Experiments (#68 )

2025-09-24 11:19:04 -07:00

README.md

LAION Multimodal Benchmark

A multimodal benchmark for evaluating image retrieval and generation performance using LEANN with CLIP embeddings and Qwen2.5-VL for multimodal generation on LAION dataset subset.

Overview

This benchmark evaluates:

Image retrieval timing using caption-based queries
Recall@K performance for image search
Complexity analysis across different search parameters
Index size and storage efficiency
Multimodal generation with Qwen2.5-VL for image understanding and description

Dataset Configuration

Dataset: LAION-400M subset (10,000 images)
Embeddings: Pre-computed CLIP ViT-B/32 (512 dimensions)
Queries: 200 random captions from the dataset
Ground Truth: Self-recall (query caption → original image)

Quick Start

1. Setup the benchmark

cd benchmarks/laion
python setup_laion.py --num-samples 10000 --num-queries 200

This will:

Create dummy LAION data (10K samples)
Generate CLIP embeddings (512-dim)
Build LEANN index with HNSW backend
Create 200 evaluation queries

2. Run evaluation

# Run all evaluation stages
python evaluate_laion.py --index data/laion_index.leann

# Run specific stages
python evaluate_laion.py --index data/laion_index.leann --stage 2  # Recall evaluation
python evaluate_laion.py --index data/laion_index.leann --stage 3  # Complexity analysis
python evaluate_laion.py --index data/laion_index.leann --stage 4  # Index comparison
python evaluate_laion.py --index data/laion_index.leann --stage 5  # Multimodal generation

# Multimodal generation with Qwen2.5-VL
python evaluate_laion.py --index data/laion_index.leann --stage 5 --model-name Qwen/Qwen2.5-VL-7B-Instruct

3. Save results

python evaluate_laion.py --index data/laion_index.leann --output results.json

Configuration Options

Setup Options

python setup_laion.py \
  --num-samples 10000 \
  --num-queries 200 \
  --index-path data/laion_index.leann \
  --backend hnsw

Evaluation Options

python evaluate_laion.py \
  --index data/laion_index.leann \
  --queries data/evaluation_queries.jsonl \
  --complexity 64 \
  --top-k 3 \
  --num-samples 100 \
  --stage all

Evaluation Stages

Stage 2: Recall Evaluation

Evaluates Recall@3 for multimodal retrieval
Compares LEANN vs FAISS baseline performance
Self-recall: query caption should retrieve original image

Stage 3: Complexity Analysis

Binary search for optimal complexity (90% recall target)
Tests performance across different complexity levels
Analyzes speed vs. accuracy tradeoffs

Stage 4: Index Comparison

Compares compact vs non-compact index sizes
Measures search performance differences
Reports storage efficiency and speed ratios

Stage 5: Multimodal Generation

Uses Qwen2.5-VL for image understanding and description
Retrieval-Augmented Generation (RAG) with multimodal context
Measures both search and generation timing

Output Metrics

Timing Metrics

Average/median/min/max search time
Standard deviation
Searches per second
Latency in milliseconds

Recall Metrics

Recall@3 percentage for image retrieval
Number of queries with ground truth

Index Metrics

Total index size (MB)
Component breakdown (index, passages, metadata)
Storage savings (compact vs non-compact)
Backend and embedding model info

Generation Metrics (Stage 5)

Average search time per query
Average generation time per query
Time distribution (search vs generation)
Sample multimodal responses
Model: Qwen2.5-VL performance

Benchmark Results

LEANN-RAG Performance (CLIP ViT-L/14 + Qwen2.5-VL)

Stage 3: Optimal Complexity Analysis

Optimal Complexity: 85 (achieving 90% Recall@3)
Binary Search Range: 1-128
Target Recall: 90%
Index Type: Non-compact (for fast binary search)

Stage 5: Multimodal Generation Performance (Qwen2.5-VL)

Total Queries: 20
Average Search Time: 1.200s per query
Average Generation Time: 6.558s per query
Time Distribution: Search 15.5%, Generation 84.5%
LLM Backend: HuggingFace transformers
Model: Qwen/Qwen2.5-VL-7B-Instruct
Optimal Complexity: 85

System Performance:

Index Size: ~10,000 image embeddings from LAION subset
Embedding Model: CLIP ViT-L/14 (768 dimensions)
Backend: HNSW with cosine distance

Example Results

🎯 LAION MULTIMODAL BENCHMARK RESULTS
============================================================

📊 Multimodal Generation Results:
  Total Queries: 20
  Avg Search Time: 1.200s
  Avg Generation Time: 6.558s
  Time Distribution: Search 15.5%, Generation 84.5%
  LLM Backend: HuggingFace transformers
  Model: Qwen/Qwen2.5-VL-7B-Instruct

⚙️ Optimal Complexity Analysis:
  Target Recall: 90%
  Optimal Complexity: 85
  Binary Search Range: 1-128
  Non-compact Index (fast search, no recompute)

🚀 Performance Summary:
  Multimodal RAG: 7.758s total per query
  Search: 15.5% of total time
  Generation: 84.5% of total time

Directory Structure

benchmarks/laion/
├── setup_laion.py           # Setup script
├── evaluate_laion.py        # Evaluation script
├── README.md               # This file
└── data/                   # Generated data
    ├── laion_images/       # Image files (placeholder)
    ├── laion_metadata.jsonl # Image metadata
    ├── laion_passages.jsonl # LEANN passages
    ├── laion_embeddings.npy # CLIP embeddings
    ├── evaluation_queries.jsonl # Evaluation queries
    └── laion_index.leann/  # LEANN index files

Notes

Current implementation uses dummy data for demonstration
For real LAION data, implement actual download logic in setup_laion.py
CLIP embeddings are randomly generated - replace with real CLIP model for production
Adjust num_samples and num_queries based on available resources
Consider using --num-samples during evaluation for faster testing