Files
LEANN/benchmarks/laion
Andy Lee fecee94af1 Experiments (#68)
* feat: finance bench

* docs: results

* chore: ignroe data README

* feat: fix financebench

* feat: laion, also required idmaps support

* style: format

* style: format

* fix: resolve ruff linting errors

- Remove unused variables in benchmark scripts
- Rename unused loop variables to follow convention

* feat: enron email bench

* experiments for running DiskANN & BM25 on Arch 4090

* style: format

* chore(ci): remove paru-bin submodule and config to fix checkout --recurse-submodules

* docs: data

* docs: data updated

* fix: as package

* fix(ci): only run pre-commit

* chore: use http url of astchunk; use group for some dev deps

* fix(ci): should checkout modules as well since `uv sync` checks

* fix(ci): run with lint only

* fix: find links to install wheels available

* CI: force local wheels in uv install step

* CI: install local wheels via file paths

* CI: pick wheels matching current Python tag

* CI: handle python tag mismatches for local wheels

* CI: use matrix python venv and set macOS deployment target

* CI: revert install step to match main

* CI: use uv group install with local wheel selection

* CI: rely on setup-uv for Python and tighten group install

* CI: install build deps with uv python interpreter

* CI: use temporary uv venv for build deps

* CI: add build venv scripts path for wheel repair
2025-09-24 11:19:04 -07:00
..
2025-09-24 11:19:04 -07:00
2025-09-24 11:19:04 -07:00
2025-09-24 11:19:04 -07:00
2025-09-24 11:19:04 -07:00

LAION Multimodal Benchmark

A multimodal benchmark for evaluating image retrieval and generation performance using LEANN with CLIP embeddings and Qwen2.5-VL for multimodal generation on LAION dataset subset.

Overview

This benchmark evaluates:

  • Image retrieval timing using caption-based queries
  • Recall@K performance for image search
  • Complexity analysis across different search parameters
  • Index size and storage efficiency
  • Multimodal generation with Qwen2.5-VL for image understanding and description

Dataset Configuration

  • Dataset: LAION-400M subset (10,000 images)
  • Embeddings: Pre-computed CLIP ViT-B/32 (512 dimensions)
  • Queries: 200 random captions from the dataset
  • Ground Truth: Self-recall (query caption → original image)

Quick Start

1. Setup the benchmark

cd benchmarks/laion
python setup_laion.py --num-samples 10000 --num-queries 200

This will:

  • Create dummy LAION data (10K samples)
  • Generate CLIP embeddings (512-dim)
  • Build LEANN index with HNSW backend
  • Create 200 evaluation queries

2. Run evaluation

# Run all evaluation stages
python evaluate_laion.py --index data/laion_index.leann

# Run specific stages
python evaluate_laion.py --index data/laion_index.leann --stage 2  # Recall evaluation
python evaluate_laion.py --index data/laion_index.leann --stage 3  # Complexity analysis
python evaluate_laion.py --index data/laion_index.leann --stage 4  # Index comparison
python evaluate_laion.py --index data/laion_index.leann --stage 5  # Multimodal generation

# Multimodal generation with Qwen2.5-VL
python evaluate_laion.py --index data/laion_index.leann --stage 5 --model-name Qwen/Qwen2.5-VL-7B-Instruct

3. Save results

python evaluate_laion.py --index data/laion_index.leann --output results.json

Configuration Options

Setup Options

python setup_laion.py \
  --num-samples 10000 \
  --num-queries 200 \
  --index-path data/laion_index.leann \
  --backend hnsw

Evaluation Options

python evaluate_laion.py \
  --index data/laion_index.leann \
  --queries data/evaluation_queries.jsonl \
  --complexity 64 \
  --top-k 3 \
  --num-samples 100 \
  --stage all

Evaluation Stages

Stage 2: Recall Evaluation

  • Evaluates Recall@3 for multimodal retrieval
  • Compares LEANN vs FAISS baseline performance
  • Self-recall: query caption should retrieve original image

Stage 3: Complexity Analysis

  • Binary search for optimal complexity (90% recall target)
  • Tests performance across different complexity levels
  • Analyzes speed vs. accuracy tradeoffs

Stage 4: Index Comparison

  • Compares compact vs non-compact index sizes
  • Measures search performance differences
  • Reports storage efficiency and speed ratios

Stage 5: Multimodal Generation

  • Uses Qwen2.5-VL for image understanding and description
  • Retrieval-Augmented Generation (RAG) with multimodal context
  • Measures both search and generation timing

Output Metrics

Timing Metrics

  • Average/median/min/max search time
  • Standard deviation
  • Searches per second
  • Latency in milliseconds

Recall Metrics

  • Recall@3 percentage for image retrieval
  • Number of queries with ground truth

Index Metrics

  • Total index size (MB)
  • Component breakdown (index, passages, metadata)
  • Storage savings (compact vs non-compact)
  • Backend and embedding model info

Generation Metrics (Stage 5)

  • Average search time per query
  • Average generation time per query
  • Time distribution (search vs generation)
  • Sample multimodal responses
  • Model: Qwen2.5-VL performance

Benchmark Results

LEANN-RAG Performance (CLIP ViT-L/14 + Qwen2.5-VL)

Stage 3: Optimal Complexity Analysis

  • Optimal Complexity: 85 (achieving 90% Recall@3)
  • Binary Search Range: 1-128
  • Target Recall: 90%
  • Index Type: Non-compact (for fast binary search)

Stage 5: Multimodal Generation Performance (Qwen2.5-VL)

  • Total Queries: 20
  • Average Search Time: 1.200s per query
  • Average Generation Time: 6.558s per query
  • Time Distribution: Search 15.5%, Generation 84.5%
  • LLM Backend: HuggingFace transformers
  • Model: Qwen/Qwen2.5-VL-7B-Instruct
  • Optimal Complexity: 85

System Performance:

  • Index Size: ~10,000 image embeddings from LAION subset
  • Embedding Model: CLIP ViT-L/14 (768 dimensions)
  • Backend: HNSW with cosine distance

Example Results

🎯 LAION MULTIMODAL BENCHMARK RESULTS
============================================================

📊 Multimodal Generation Results:
  Total Queries: 20
  Avg Search Time: 1.200s
  Avg Generation Time: 6.558s
  Time Distribution: Search 15.5%, Generation 84.5%
  LLM Backend: HuggingFace transformers
  Model: Qwen/Qwen2.5-VL-7B-Instruct

⚙️ Optimal Complexity Analysis:
  Target Recall: 90%
  Optimal Complexity: 85
  Binary Search Range: 1-128
  Non-compact Index (fast search, no recompute)

🚀 Performance Summary:
  Multimodal RAG: 7.758s total per query
  Search: 15.5% of total time
  Generation: 84.5% of total time

Directory Structure

benchmarks/laion/
├── setup_laion.py           # Setup script
├── evaluate_laion.py        # Evaluation script
├── README.md               # This file
└── data/                   # Generated data
    ├── laion_images/       # Image files (placeholder)
    ├── laion_metadata.jsonl # Image metadata
    ├── laion_passages.jsonl # LEANN passages
    ├── laion_embeddings.npy # CLIP embeddings
    ├── evaluation_queries.jsonl # Evaluation queries
    └── laion_index.leann/  # LEANN index files

Notes

  • Current implementation uses dummy data for demonstration
  • For real LAION data, implement actual download logic in setup_laion.py
  • CLIP embeddings are randomly generated - replace with real CLIP model for production
  • Adjust num_samples and num_queries based on available resources
  • Consider using --num-samples during evaluation for faster testing