Files
LEANN/benchmarks/laion

LAION Multimodal Benchmark

A multimodal benchmark for evaluating image retrieval performance using LEANN with CLIP embeddings on LAION dataset subset.

Overview

This benchmark evaluates:

  • Image retrieval timing using caption-based queries
  • Recall@K performance for image search
  • Complexity analysis across different search parameters
  • Index size and storage efficiency

Dataset Configuration

  • Dataset: LAION-400M subset (10,000 images)
  • Embeddings: Pre-computed CLIP ViT-B/32 (512 dimensions)
  • Queries: 200 random captions from the dataset
  • Ground Truth: Self-recall (query caption → original image)

Quick Start

1. Setup the benchmark

cd benchmarks/laion
python setup_laion.py --num-samples 10000 --num-queries 200

This will:

  • Create dummy LAION data (10K samples)
  • Generate CLIP embeddings (512-dim)
  • Build LEANN index with HNSW backend
  • Create 200 evaluation queries

2. Run evaluation

# Run all evaluation stages
python evaluate_laion.py --index data/laion_index.leann

# Run specific stages
python evaluate_laion.py --index data/laion_index.leann --stage timing
python evaluate_laion.py --index data/laion_index.leann --stage recall
python evaluate_laion.py --index data/laion_index.leann --stage complexity

3. Save results

python evaluate_laion.py --index data/laion_index.leann --output results.json

Configuration Options

Setup Options

python setup_laion.py \
  --num-samples 10000 \
  --num-queries 200 \
  --index-path data/laion_index.leann \
  --backend hnsw

Evaluation Options

python evaluate_laion.py \
  --index data/laion_index.leann \
  --queries data/evaluation_queries.jsonl \
  --complexity 64 \
  --top-k 3 \
  --num-samples 100 \
  --stage all

Evaluation Stages

Stage 1: Index Analysis

  • Analyzes index file sizes and metadata
  • Reports storage efficiency

Stage 2: Search Timing

  • Measures average search latency
  • Tests with configurable complexity and top-k
  • Reports searches per second

Stage 3: Recall Evaluation

  • Evaluates Recall@K using ground truth
  • Self-recall: query caption should retrieve original image

Stage 4: Complexity Analysis

  • Tests performance across different complexity levels [16, 32, 64, 128]
  • Analyzes speed vs. accuracy tradeoffs

Output Metrics

Timing Metrics

  • Average/median/min/max search time
  • Standard deviation
  • Searches per second
  • Latency in milliseconds

Recall Metrics

  • Recall@K percentage
  • Number of queries with ground truth

Index Metrics

  • Total index size (MB)
  • Component breakdown (index, passages, metadata)
  • Backend and embedding model info

Example Results

🎯 LAION MULTIMODAL BENCHMARK RESULTS
============================================================

📏 Index Information:
  Total size: 145.2 MB
  Backend: hnsw
  Embedding model: clip-vit-b-32
  Total passages: 10000

⚡ Search Performance:
  Total queries: 200
  Average search time: 0.023s
  Median search time: 0.021s
  Min/Max search time: 0.012s / 0.089s
  Std dev: 0.008s
  Complexity: 64
  Top-K: 3

📊 Recall Performance:
  Recall@3: 85.5%
  Queries with ground truth: 200

⚙️ Complexity Analysis:
  Complexity  16: 0.015s avg
  Complexity  32: 0.019s avg
  Complexity  64: 0.023s avg
  Complexity 128: 0.031s avg

🚀 Performance Summary:
  Searches per second: 43.5
  Latency (ms): 23.0ms

Directory Structure

benchmarks/laion/
├── setup_laion.py           # Setup script
├── evaluate_laion.py        # Evaluation script
├── README.md               # This file
└── data/                   # Generated data
    ├── laion_images/       # Image files (placeholder)
    ├── laion_metadata.jsonl # Image metadata
    ├── laion_passages.jsonl # LEANN passages
    ├── laion_embeddings.npy # CLIP embeddings
    ├── evaluation_queries.jsonl # Evaluation queries
    └── laion_index.leann/  # LEANN index files

Notes

  • Current implementation uses dummy data for demonstration
  • For real LAION data, implement actual download logic in setup_laion.py
  • CLIP embeddings are randomly generated - replace with real CLIP model for production
  • Adjust num_samples and num_queries based on available resources
  • Consider using --num-samples during evaluation for faster testing