5.9 KiB
5.9 KiB
LAION Multimodal Benchmark
A multimodal benchmark for evaluating image retrieval and generation performance using LEANN with CLIP embeddings and Qwen2.5-VL for multimodal generation on LAION dataset subset.
Overview
This benchmark evaluates:
- Image retrieval timing using caption-based queries
- Recall@K performance for image search
- Complexity analysis across different search parameters
- Index size and storage efficiency
- Multimodal generation with Qwen2.5-VL for image understanding and description
Dataset Configuration
- Dataset: LAION-400M subset (10,000 images)
- Embeddings: Pre-computed CLIP ViT-B/32 (512 dimensions)
- Queries: 200 random captions from the dataset
- Ground Truth: Self-recall (query caption → original image)
Quick Start
1. Setup the benchmark
cd benchmarks/laion
python setup_laion.py --num-samples 10000 --num-queries 200
This will:
- Create dummy LAION data (10K samples)
- Generate CLIP embeddings (512-dim)
- Build LEANN index with HNSW backend
- Create 200 evaluation queries
2. Run evaluation
# Run all evaluation stages
python evaluate_laion.py --index data/laion_index.leann
# Run specific stages
python evaluate_laion.py --index data/laion_index.leann --stage 2 # Recall evaluation
python evaluate_laion.py --index data/laion_index.leann --stage 3 # Complexity analysis
python evaluate_laion.py --index data/laion_index.leann --stage 4 # Index comparison
python evaluate_laion.py --index data/laion_index.leann --stage 5 # Multimodal generation
# Multimodal generation with Qwen2.5-VL
python evaluate_laion.py --index data/laion_index.leann --stage 5 --model-name Qwen/Qwen2.5-VL-7B-Instruct
3. Save results
python evaluate_laion.py --index data/laion_index.leann --output results.json
Configuration Options
Setup Options
python setup_laion.py \
--num-samples 10000 \
--num-queries 200 \
--index-path data/laion_index.leann \
--backend hnsw
Evaluation Options
python evaluate_laion.py \
--index data/laion_index.leann \
--queries data/evaluation_queries.jsonl \
--complexity 64 \
--top-k 3 \
--num-samples 100 \
--stage all
Evaluation Stages
Stage 2: Recall Evaluation
- Evaluates Recall@3 for multimodal retrieval
- Compares LEANN vs FAISS baseline performance
- Self-recall: query caption should retrieve original image
Stage 3: Complexity Analysis
- Binary search for optimal complexity (90% recall target)
- Tests performance across different complexity levels
- Analyzes speed vs. accuracy tradeoffs
Stage 4: Index Comparison
- Compares compact vs non-compact index sizes
- Measures search performance differences
- Reports storage efficiency and speed ratios
Stage 5: Multimodal Generation
- Uses Qwen2.5-VL for image understanding and description
- Retrieval-Augmented Generation (RAG) with multimodal context
- Measures both search and generation timing
Output Metrics
Timing Metrics
- Average/median/min/max search time
- Standard deviation
- Searches per second
- Latency in milliseconds
Recall Metrics
- Recall@3 percentage for image retrieval
- Number of queries with ground truth
Index Metrics
- Total index size (MB)
- Component breakdown (index, passages, metadata)
- Storage savings (compact vs non-compact)
- Backend and embedding model info
Generation Metrics (Stage 5)
- Average search time per query
- Average generation time per query
- Time distribution (search vs generation)
- Sample multimodal responses
- Model: Qwen2.5-VL performance
Benchmark Results
LEANN-RAG Performance (CLIP ViT-L/14 + Qwen2.5-VL)
Stage 3: Optimal Complexity Analysis
- Optimal Complexity: 85 (achieving 90% Recall@3)
- Binary Search Range: 1-128
- Target Recall: 90%
- Index Type: Non-compact (for fast binary search)
Stage 5: Multimodal Generation Performance (Qwen2.5-VL)
- Total Queries: 20
- Average Search Time: 1.200s per query
- Average Generation Time: 6.558s per query
- Time Distribution: Search 15.5%, Generation 84.5%
- LLM Backend: HuggingFace transformers
- Model: Qwen/Qwen2.5-VL-7B-Instruct
- Optimal Complexity: 85
System Performance:
- Index Size: ~10,000 image embeddings from LAION subset
- Embedding Model: CLIP ViT-L/14 (768 dimensions)
- Backend: HNSW with cosine distance
Example Results
🎯 LAION MULTIMODAL BENCHMARK RESULTS
============================================================
📊 Multimodal Generation Results:
Total Queries: 20
Avg Search Time: 1.200s
Avg Generation Time: 6.558s
Time Distribution: Search 15.5%, Generation 84.5%
LLM Backend: HuggingFace transformers
Model: Qwen/Qwen2.5-VL-7B-Instruct
⚙️ Optimal Complexity Analysis:
Target Recall: 90%
Optimal Complexity: 85
Binary Search Range: 1-128
Non-compact Index (fast search, no recompute)
🚀 Performance Summary:
Multimodal RAG: 7.758s total per query
Search: 15.5% of total time
Generation: 84.5% of total time
Directory Structure
benchmarks/laion/
├── setup_laion.py # Setup script
├── evaluate_laion.py # Evaluation script
├── README.md # This file
└── data/ # Generated data
├── laion_images/ # Image files (placeholder)
├── laion_metadata.jsonl # Image metadata
├── laion_passages.jsonl # LEANN passages
├── laion_embeddings.npy # CLIP embeddings
├── evaluation_queries.jsonl # Evaluation queries
└── laion_index.leann/ # LEANN index files
Notes
- Current implementation uses dummy data for demonstration
- For real LAION data, implement actual download logic in
setup_laion.py - CLIP embeddings are randomly generated - replace with real CLIP model for production
- Adjust
num_samplesandnum_queriesbased on available resources - Consider using
--num-samplesduring evaluation for faster testing