docs: data updated

2025-09-15 19:50:02 -07:00
parent d7011bbea0
commit a0d6857faa
9 changed files with 749 additions and 133 deletions
@@ -1,6 +1,6 @@
 # LAION Multimodal Benchmark

-A multimodal benchmark for evaluating image retrieval performance using LEANN with CLIP embeddings on LAION dataset subset.
+A multimodal benchmark for evaluating image retrieval and generation performance using LEANN with CLIP embeddings and Qwen2.5-VL for multimodal generation on LAION dataset subset.

 ## Overview

@@ -9,6 +9,7 @@ This benchmark evaluates:
 - **Recall@K performance** for image search
 - **Complexity analysis** across different search parameters
 - **Index size and storage efficiency**
+- **Multimodal generation** with Qwen2.5-VL for image understanding and description

 ## Dataset Configuration

@@ -39,9 +40,13 @@ This will:
 python evaluate_laion.py --index data/laion_index.leann

 # Run specific stages
-python evaluate_laion.py --index data/laion_index.leann --stage timing
-python evaluate_laion.py --index data/laion_index.leann --stage recall
-python evaluate_laion.py --index data/laion_index.leann --stage complexity
+python evaluate_laion.py --index data/laion_index.leann --stage 2  # Recall evaluation
+python evaluate_laion.py --index data/laion_index.leann --stage 3  # Complexity analysis
+python evaluate_laion.py --index data/laion_index.leann --stage 4  # Index comparison
+python evaluate_laion.py --index data/laion_index.leann --stage 5  # Multimodal generation
+
+# Multimodal generation with Qwen2.5-VL
+python evaluate_laion.py --index data/laion_index.leann --stage 5 --model-name Qwen/Qwen2.5-VL-7B-Instruct
 ```

 ### 3. Save results
@@ -74,23 +79,26 @@ python evaluate_laion.py \

 ## Evaluation Stages

-### Stage 1: Index Analysis
- Analyzes index file sizes and metadata
- Reports storage efficiency
-
-### Stage 2: Search Timing
- Measures average search latency
- Tests with configurable complexity and top-k
- Reports searches per second
-
-### Stage 3: Recall Evaluation
- Evaluates Recall@K using ground truth
+### Stage 2: Recall Evaluation
+- Evaluates Recall@3 for multimodal retrieval
+- Compares LEANN vs FAISS baseline performance
 - Self-recall: query caption should retrieve original image

-### Stage 4: Complexity Analysis
- Tests performance across different complexity levels [16, 32, 64, 128]
+### Stage 3: Complexity Analysis
+- Binary search for optimal complexity (90% recall target)
+- Tests performance across different complexity levels
 - Analyzes speed vs. accuracy tradeoffs

+### Stage 4: Index Comparison
+- Compares compact vs non-compact index sizes
+- Measures search performance differences
+- Reports storage efficiency and speed ratios
+
+### Stage 5: Multimodal Generation
+- Uses Qwen2.5-VL for image understanding and description
+- Retrieval-Augmented Generation (RAG) with multimodal context
+- Measures both search and generation timing
+
 ## Output Metrics

 ### Timing Metrics
@@ -100,48 +108,70 @@ python evaluate_laion.py \
 - Latency in milliseconds

 ### Recall Metrics
- Recall@K percentage
+- Recall@3 percentage for image retrieval
 - Number of queries with ground truth

 ### Index Metrics
 - Total index size (MB)
 - Component breakdown (index, passages, metadata)
+- Storage savings (compact vs non-compact)
 - Backend and embedding model info

-## Example Results
+### Generation Metrics (Stage 5)
+- Average search time per query
+- Average generation time per query
+- Time distribution (search vs generation)
+- Sample multimodal responses
+- Model: Qwen2.5-VL performance
+
+## Benchmark Results
+
+### LEANN-RAG Performance (CLIP ViT-L/14 + Qwen2.5-VL)
+
+**Stage 3: Optimal Complexity Analysis**
+- **Optimal Complexity**: 85 (achieving 90% Recall@3)
+- **Binary Search Range**: 1-128
+- **Target Recall**: 90%
+- **Index Type**: Non-compact (for fast binary search)
+
+**Stage 5: Multimodal Generation Performance (Qwen2.5-VL)**
+- **Total Queries**: 20
+- **Average Search Time**: 1.200s per query
+- **Average Generation Time**: 6.558s per query
+- **Time Distribution**: Search 15.5%, Generation 84.5%
+- **LLM Backend**: HuggingFace transformers
+- **Model**: Qwen/Qwen2.5-VL-7B-Instruct
+- **Optimal Complexity**: 85
+
+**System Performance:**
+- **Index Size**: ~10,000 image embeddings from LAION subset
+- **Embedding Model**: CLIP ViT-L/14 (768 dimensions)
+- **Backend**: HNSW with cosine distance
+
+### Example Results

 ```
 🎯 LAION MULTIMODAL BENCHMARK RESULTS
 ============================================================

-📏 Index Information:
-  Total size: 145.2 MB
-  Backend: hnsw
-  Embedding model: clip-vit-b-32
-  Total passages: 10000
+📊 Multimodal Generation Results:
+  Total Queries: 20
+  Avg Search Time: 1.200s
+  Avg Generation Time: 6.558s
+  Time Distribution: Search 15.5%, Generation 84.5%
+  LLM Backend: HuggingFace transformers
+  Model: Qwen/Qwen2.5-VL-7B-Instruct

-⚡ Search Performance:
-  Total queries: 200
-  Average search time: 0.023s
-  Median search time: 0.021s
-  Min/Max search time: 0.012s / 0.089s
-  Std dev: 0.008s
-  Complexity: 64
-  Top-K: 3
-
-📊 Recall Performance:
-  Recall@3: 85.5%
-  Queries with ground truth: 200
-
-⚙️ Complexity Analysis:
-  Complexity  16: 0.015s avg
-  Complexity  32: 0.019s avg
-  Complexity  64: 0.023s avg
-  Complexity 128: 0.031s avg
+⚙️ Optimal Complexity Analysis:
+  Target Recall: 90%
+  Optimal Complexity: 85
+  Binary Search Range: 1-128
+  Non-compact Index (fast search, no recompute)

 🚀 Performance Summary:
-  Searches per second: 43.5
-  Latency (ms): 23.0ms
+  Multimodal RAG: 7.758s total per query
+  Search: 15.5% of total time
+  Generation: 84.5% of total time
 ```

 ## Directory Structure
@@ -4,6 +4,7 @@ LAION Multimodal Benchmark Evaluation Script - Modular Recall-based Evaluation

 import argparse
 import json
+import logging
 import os
 import pickle
 import time
@@ -14,6 +15,13 @@ from leann import LeannSearcher
 from leann_backend_hnsw import faiss
 from sentence_transformers import SentenceTransformer

+from ..llm_utils import evaluate_multimodal_rag, load_qwen_vl_model
+
+# Setup logging to reduce verbose output
+logging.basicConfig(level=logging.WARNING)
+logging.getLogger("leann.api").setLevel(logging.WARNING)
+logging.getLogger("leann_backend_hnsw").setLevel(logging.WARNING)
+

 class RecallEvaluator:
    """Stage 2: Evaluate Recall@3 (LEANN vs FAISS baseline for multimodal retrieval)"""
@@ -388,13 +396,22 @@ def main():
    )
    parser.add_argument(
        "--stage",
-        choices=["2", "3", "4", "all"],
+        choices=["2", "3", "4", "5", "all"],
        default="all",
-        help="Which stage to run (2=recall, 3=complexity, 4=index comparison)",
+        help="Which stage to run (2=recall, 3=complexity, 4=index comparison, 5=generation)",
    )
    parser.add_argument("--complexity", type=int, default=None, help="Complexity for search")
    parser.add_argument("--baseline-dir", default="baseline", help="Baseline output directory")
    parser.add_argument("--output", help="Save results to JSON file")
+    parser.add_argument(
+        "--llm-backend",
+        choices=["hf"],
+        default="hf",
+        help="LLM backend (Qwen2.5-VL only supports HF)",
+    )
+    parser.add_argument(
+        "--model-name", default="Qwen/Qwen2.5-VL-7B-Instruct", help="Multimodal model name"
+    )

    args = parser.parse_args()

@@ -615,12 +632,69 @@ def main():
            evaluator.cleanup()
            print("✅ Stage 4 completed!\n")

+        if args.stage in ("5", "all"):
+            print("🚀 Starting Stage 5: Multimodal generation with Qwen2.5-VL")
+            evaluator = LAIONEvaluator(args.index)
+            captions = evaluator.load_queries(args.queries)
+            test_captions = captions[: min(20, len(captions))]  # Use subset for generation
+
+            print(f"🧪 Testing multimodal generation with {len(test_captions)} queries")
+
+            # Load Qwen2.5-VL model
+            try:
+                print("Loading Qwen2.5-VL model...")
+                processor, model = load_qwen_vl_model(args.model_name)
+
+                # Run multimodal generation evaluation
+                complexity = args.complexity or 64
+                gen_results = evaluate_multimodal_rag(
+                    evaluator.searcher,
+                    test_captions,
+                    processor=processor,
+                    model=model,
+                    complexity=complexity,
+                )
+
+                print("\n📊 Multimodal Generation Results:")
+                print(f"  Total Queries: {len(test_captions)}")
+                print(f"  Avg Search Time: {gen_results['avg_search_time']:.3f}s")
+                print(f"  Avg Generation Time: {gen_results['avg_generation_time']:.3f}s")
+                total_time = gen_results["avg_search_time"] + gen_results["avg_generation_time"]
+                search_pct = (gen_results["avg_search_time"] / total_time) * 100
+                gen_pct = (gen_results["avg_generation_time"] / total_time) * 100
+                print(f"  Time Distribution: Search {search_pct:.1f}%, Generation {gen_pct:.1f}%")
+                print("  LLM Backend: HuggingFace transformers")
+                print(f"  Model: {args.model_name}")
+
+                # Show sample results
+                print("\n📝 Sample Multimodal Generations:")
+                for i, response in enumerate(gen_results["results"][:3]):
+                    # Handle both string and dict formats for captions
+                    if isinstance(test_captions[i], dict):
+                        caption_text = test_captions[i].get("query", str(test_captions[i]))
+                    else:
+                        caption_text = str(test_captions[i])
+                    print(f"  Query {i + 1}: {caption_text[:60]}...")
+                    print(f"  Response {i + 1}: {response[:100]}...")
+                    print()
+
+            except Exception as e:
+                print(f"❌ Multimodal generation evaluation failed: {e}")
+                print("💡 Make sure transformers and Qwen2.5-VL are installed")
+                import traceback
+
+                traceback.print_exc()
+
+            evaluator.cleanup()
+            print("✅ Stage 5 completed!\n")
+
        if args.stage == "all":
            print("🎉 All evaluation stages completed successfully!")
            print("\n📋 Summary:")
            print("  Stage 2: ✅ Multimodal Recall@3 evaluation completed")
            print("  Stage 3: ✅ Optimal complexity found")
            print("  Stage 4: ✅ Index comparison analysis completed")
+            print("  Stage 5: ✅ Multimodal generation evaluation completed")
            print("\n🔧 Recommended next steps:")
            print("  - Use optimal complexity for best speed/accuracy balance")
            print("  - Review index comparison for storage vs performance tradeoffs")