* feat: finance bench * docs: results * chore: ignroe data README * feat: fix financebench * feat: laion, also required idmaps support * style: format * style: format * fix: resolve ruff linting errors - Remove unused variables in benchmark scripts - Rename unused loop variables to follow convention * feat: enron email bench * experiments for running DiskANN & BM25 on Arch 4090 * style: format * chore(ci): remove paru-bin submodule and config to fix checkout --recurse-submodules * docs: data * docs: data updated * fix: as package * fix(ci): only run pre-commit * chore: use http url of astchunk; use group for some dev deps * fix(ci): should checkout modules as well since `uv sync` checks * fix(ci): run with lint only * fix: find links to install wheels available * CI: force local wheels in uv install step * CI: install local wheels via file paths * CI: pick wheels matching current Python tag * CI: handle python tag mismatches for local wheels * CI: use matrix python venv and set macOS deployment target * CI: revert install step to match main * CI: use uv group install with local wheel selection * CI: rely on setup-uv for Python and tighten group install * CI: install build deps with uv python interpreter * CI: use temporary uv venv for build deps * CI: add build venv scripts path for wheel repair
200 lines
5.9 KiB
Markdown
200 lines
5.9 KiB
Markdown
# LAION Multimodal Benchmark
|
|
|
|
A multimodal benchmark for evaluating image retrieval and generation performance using LEANN with CLIP embeddings and Qwen2.5-VL for multimodal generation on LAION dataset subset.
|
|
|
|
## Overview
|
|
|
|
This benchmark evaluates:
|
|
- **Image retrieval timing** using caption-based queries
|
|
- **Recall@K performance** for image search
|
|
- **Complexity analysis** across different search parameters
|
|
- **Index size and storage efficiency**
|
|
- **Multimodal generation** with Qwen2.5-VL for image understanding and description
|
|
|
|
## Dataset Configuration
|
|
|
|
- **Dataset**: LAION-400M subset (10,000 images)
|
|
- **Embeddings**: Pre-computed CLIP ViT-B/32 (512 dimensions)
|
|
- **Queries**: 200 random captions from the dataset
|
|
- **Ground Truth**: Self-recall (query caption → original image)
|
|
|
|
## Quick Start
|
|
|
|
### 1. Setup the benchmark
|
|
|
|
```bash
|
|
cd benchmarks/laion
|
|
python setup_laion.py --num-samples 10000 --num-queries 200
|
|
```
|
|
|
|
This will:
|
|
- Create dummy LAION data (10K samples)
|
|
- Generate CLIP embeddings (512-dim)
|
|
- Build LEANN index with HNSW backend
|
|
- Create 200 evaluation queries
|
|
|
|
### 2. Run evaluation
|
|
|
|
```bash
|
|
# Run all evaluation stages
|
|
python evaluate_laion.py --index data/laion_index.leann
|
|
|
|
# Run specific stages
|
|
python evaluate_laion.py --index data/laion_index.leann --stage 2 # Recall evaluation
|
|
python evaluate_laion.py --index data/laion_index.leann --stage 3 # Complexity analysis
|
|
python evaluate_laion.py --index data/laion_index.leann --stage 4 # Index comparison
|
|
python evaluate_laion.py --index data/laion_index.leann --stage 5 # Multimodal generation
|
|
|
|
# Multimodal generation with Qwen2.5-VL
|
|
python evaluate_laion.py --index data/laion_index.leann --stage 5 --model-name Qwen/Qwen2.5-VL-7B-Instruct
|
|
```
|
|
|
|
### 3. Save results
|
|
|
|
```bash
|
|
python evaluate_laion.py --index data/laion_index.leann --output results.json
|
|
```
|
|
|
|
## Configuration Options
|
|
|
|
### Setup Options
|
|
```bash
|
|
python setup_laion.py \
|
|
--num-samples 10000 \
|
|
--num-queries 200 \
|
|
--index-path data/laion_index.leann \
|
|
--backend hnsw
|
|
```
|
|
|
|
### Evaluation Options
|
|
```bash
|
|
python evaluate_laion.py \
|
|
--index data/laion_index.leann \
|
|
--queries data/evaluation_queries.jsonl \
|
|
--complexity 64 \
|
|
--top-k 3 \
|
|
--num-samples 100 \
|
|
--stage all
|
|
```
|
|
|
|
## Evaluation Stages
|
|
|
|
### Stage 2: Recall Evaluation
|
|
- Evaluates Recall@3 for multimodal retrieval
|
|
- Compares LEANN vs FAISS baseline performance
|
|
- Self-recall: query caption should retrieve original image
|
|
|
|
### Stage 3: Complexity Analysis
|
|
- Binary search for optimal complexity (90% recall target)
|
|
- Tests performance across different complexity levels
|
|
- Analyzes speed vs. accuracy tradeoffs
|
|
|
|
### Stage 4: Index Comparison
|
|
- Compares compact vs non-compact index sizes
|
|
- Measures search performance differences
|
|
- Reports storage efficiency and speed ratios
|
|
|
|
### Stage 5: Multimodal Generation
|
|
- Uses Qwen2.5-VL for image understanding and description
|
|
- Retrieval-Augmented Generation (RAG) with multimodal context
|
|
- Measures both search and generation timing
|
|
|
|
## Output Metrics
|
|
|
|
### Timing Metrics
|
|
- Average/median/min/max search time
|
|
- Standard deviation
|
|
- Searches per second
|
|
- Latency in milliseconds
|
|
|
|
### Recall Metrics
|
|
- Recall@3 percentage for image retrieval
|
|
- Number of queries with ground truth
|
|
|
|
### Index Metrics
|
|
- Total index size (MB)
|
|
- Component breakdown (index, passages, metadata)
|
|
- Storage savings (compact vs non-compact)
|
|
- Backend and embedding model info
|
|
|
|
### Generation Metrics (Stage 5)
|
|
- Average search time per query
|
|
- Average generation time per query
|
|
- Time distribution (search vs generation)
|
|
- Sample multimodal responses
|
|
- Model: Qwen2.5-VL performance
|
|
|
|
## Benchmark Results
|
|
|
|
### LEANN-RAG Performance (CLIP ViT-L/14 + Qwen2.5-VL)
|
|
|
|
**Stage 3: Optimal Complexity Analysis**
|
|
- **Optimal Complexity**: 85 (achieving 90% Recall@3)
|
|
- **Binary Search Range**: 1-128
|
|
- **Target Recall**: 90%
|
|
- **Index Type**: Non-compact (for fast binary search)
|
|
|
|
**Stage 5: Multimodal Generation Performance (Qwen2.5-VL)**
|
|
- **Total Queries**: 20
|
|
- **Average Search Time**: 1.200s per query
|
|
- **Average Generation Time**: 6.558s per query
|
|
- **Time Distribution**: Search 15.5%, Generation 84.5%
|
|
- **LLM Backend**: HuggingFace transformers
|
|
- **Model**: Qwen/Qwen2.5-VL-7B-Instruct
|
|
- **Optimal Complexity**: 85
|
|
|
|
**System Performance:**
|
|
- **Index Size**: ~10,000 image embeddings from LAION subset
|
|
- **Embedding Model**: CLIP ViT-L/14 (768 dimensions)
|
|
- **Backend**: HNSW with cosine distance
|
|
|
|
### Example Results
|
|
|
|
```
|
|
🎯 LAION MULTIMODAL BENCHMARK RESULTS
|
|
============================================================
|
|
|
|
📊 Multimodal Generation Results:
|
|
Total Queries: 20
|
|
Avg Search Time: 1.200s
|
|
Avg Generation Time: 6.558s
|
|
Time Distribution: Search 15.5%, Generation 84.5%
|
|
LLM Backend: HuggingFace transformers
|
|
Model: Qwen/Qwen2.5-VL-7B-Instruct
|
|
|
|
⚙️ Optimal Complexity Analysis:
|
|
Target Recall: 90%
|
|
Optimal Complexity: 85
|
|
Binary Search Range: 1-128
|
|
Non-compact Index (fast search, no recompute)
|
|
|
|
🚀 Performance Summary:
|
|
Multimodal RAG: 7.758s total per query
|
|
Search: 15.5% of total time
|
|
Generation: 84.5% of total time
|
|
```
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
benchmarks/laion/
|
|
├── setup_laion.py # Setup script
|
|
├── evaluate_laion.py # Evaluation script
|
|
├── README.md # This file
|
|
└── data/ # Generated data
|
|
├── laion_images/ # Image files (placeholder)
|
|
├── laion_metadata.jsonl # Image metadata
|
|
├── laion_passages.jsonl # LEANN passages
|
|
├── laion_embeddings.npy # CLIP embeddings
|
|
├── evaluation_queries.jsonl # Evaluation queries
|
|
└── laion_index.leann/ # LEANN index files
|
|
```
|
|
|
|
## Notes
|
|
|
|
- Current implementation uses dummy data for demonstration
|
|
- For real LAION data, implement actual download logic in `setup_laion.py`
|
|
- CLIP embeddings are randomly generated - replace with real CLIP model for production
|
|
- Adjust `num_samples` and `num_queries` based on available resources
|
|
- Consider using `--num-samples` during evaluation for faster testing
|