feat: Add ColQwen multimodal PDF retrieval integration

2025-12-19 13:54:38 -08:00
parent 0175bc9c20
commit 360fdf575c
3 changed files with 0 additions and 376 deletions
@@ -1,200 +0,0 @@
-# ColQwen Integration Guide
-
-Easy-to-use multimodal PDF retrieval with ColQwen2/ColPali models.
-
-## Quick Start
-
-> **🍎 Mac Users**: ColQwen is optimized for Apple Silicon with MPS acceleration for faster inference!
-
-### 1. Install Dependencies
-```bash
-uv pip install colpali_engine pdf2image pillow matplotlib qwen_vl_utils einops seaborn
-brew install poppler  # macOS only, for PDF processing
-```
-
-### 2. Basic Usage
-```bash
-# Build index from PDFs
-python -m apps.colqwen_rag build --pdfs ./my_papers/ --index research_papers
-
-# Search with text queries
-python -m apps.colqwen_rag search research_papers "How does attention mechanism work?"
-
-# Interactive Q&A
-python -m apps.colqwen_rag ask research_papers --interactive
-```
-
-## Commands
-
-### Build Index
-```bash
-python -m apps.colqwen_rag build \
-  --pdfs ./pdf_directory/ \
-  --index my_index \
-  --model colqwen2 \
-  --pages-dir ./page_images/  # Optional: save page images
-```
-
-**Options:**
- `--pdfs`: Directory containing PDF files (or single PDF path)
- `--index`: Name for the index (required)
- `--model`: `colqwen2` (default) or `colpali`
- `--pages-dir`: Directory to save page images (optional)
-
-### Search Index
-```bash
-python -m apps.colqwen_rag search my_index "your question here" --top-k 5
-```
-
-**Options:**
- `--top-k`: Number of results to return (default: 5)
- `--model`: Model used for search (should match build model)
-
-### Interactive Q&A
-```bash
-python -m apps.colqwen_rag ask my_index --interactive
-```
-
-**Commands in interactive mode:**
- Type your questions naturally
- `help`: Show available commands
- `quit`/`exit`/`q`: Exit interactive mode
-
-## 🧪 Test & Reproduce Results
-
-Run the reproduction test for issue #119:
-```bash
-python test_colqwen_reproduction.py
-```
-
-This will:
-1. ✅ Check dependencies
-2. 📥 Download sample PDF (Attention Is All You Need paper)
-3. 🏗️ Build test index
-4. 🔍 Run sample queries
-5. 📊 Show how to generate similarity maps
-
-## 🎨 Advanced: Similarity Maps
-
-For visual similarity analysis, use the existing advanced script:
-```bash
-cd apps/multimodal/vision-based-pdf-multi-vector/
-python multi-vector-leann-similarity-map.py
-```
-
-Edit the script to customize:
- `QUERY`: Your question
- `MODEL`: "colqwen2" or "colpali"
- `USE_HF_DATASET`: Use HuggingFace dataset or local PDFs
- `SIMILARITY_MAP`: Generate heatmaps
- `ANSWER`: Enable Qwen-VL answer generation
-
-## 🔧 How It Works
-
-### ColQwen2 vs ColPali
- **ColQwen2** (`vidore/colqwen2-v1.0`): Latest vision-language model
- **ColPali** (`vidore/colpali-v1.2`): Proven multimodal retriever
-
-### Architecture
-1. **PDF → Images**: Convert PDF pages to images (150 DPI)
-2. **Vision Encoding**: Process images with ColQwen2/ColPali
-3. **Multi-Vector Index**: Build LEANN HNSW index with multiple embeddings per page
-4. **Query Processing**: Encode text queries with same model
-5. **Similarity Search**: Find most relevant pages/regions
-6. **Visual Maps**: Generate attention heatmaps (optional)
-
-### Device Support
- **CUDA**: Best performance with GPU acceleration
- **MPS**: Apple Silicon Mac support
- **CPU**: Fallback for any system (slower)
-
-Auto-detection: CUDA > MPS > CPU
-
-## 📊 Performance Tips
-
-### For Best Performance:
-```bash
-# Use ColQwen2 for latest features
--model colqwen2
-
-# Save page images for reuse
--pages-dir ./cached_pages/
-
-# Adjust batch size based on GPU memory
-# (automatically handled)
-```
-
-### For Large Document Sets:
- Process PDFs in batches
- Use SSD storage for index files
- Consider using CUDA if available
-
-## 🔗 Related Resources
-
- **Fast-PLAID**: https://github.com/lightonai/fast-plaid
- **Pylate**: https://github.com/lightonai/pylate
- **ColBERT**: https://github.com/stanford-futuredata/ColBERT
- **ColPali Paper**: Vision-Language Models for Document Retrieval
- **Issue #119**: https://github.com/yichuan-w/LEANN/issues/119
-
-## 🐛 Troubleshooting
-
-### PDF Conversion Issues (macOS)
-```bash
-# Install poppler
-brew install poppler
-which pdfinfo && pdfinfo -v
-```
-
-### Memory Issues
- Reduce batch size (automatically handled)
- Use CPU instead of GPU: `export CUDA_VISIBLE_DEVICES=""`
- Process fewer PDFs at once
-
-### Model Download Issues
- Ensure internet connection for first run
- Models are cached after first download
- Use HuggingFace mirrors if needed
-
-### Import Errors
-```bash
-# Ensure all dependencies installed
-uv pip install colpali_engine pdf2image pillow matplotlib qwen_vl_utils einops seaborn
-
-# Check PyTorch installation
-python -c "import torch; print(torch.__version__)"
-```
-
-## 💡 Examples
-
-### Research Paper Analysis
-```bash
-# Index your research papers
-python -m apps.colqwen_rag build --pdfs ~/Papers/AI/ --index ai_papers
-
-# Ask research questions
-python -m apps.colqwen_rag search ai_papers "What are the limitations of transformer models?"
-python -m apps.colqwen_rag search ai_papers "How does BERT compare to GPT?"
-```
-
-### Document Q&A
-```bash
-# Index business documents
-python -m apps.colqwen_rag build --pdfs ~/Documents/Reports/ --index reports
-
-# Interactive analysis
-python -m apps.colqwen_rag ask reports --interactive
-```
-
-### Visual Analysis
-```bash
-# Generate similarity maps for specific queries
-cd apps/multimodal/vision-based-pdf-multi-vector/
-# Edit multi-vector-leann-similarity-map.py with your query
-python multi-vector-leann-similarity-map.py
-# Check ./figures/ for generated heatmaps
-```
-
---
-
-**🎯 This integration makes ColQwen as easy to use as other LEANN features while maintaining the full power of multimodal document understanding!**
@@ -60,20 +60,6 @@ python -m apps.colqwen_rag ask my_index --interactive
 - `help`: Show available commands
 - `quit`/`exit`/`q`: Exit interactive mode

-## 🧪 Test & Reproduce Results
-
-Run the reproduction test for issue #119:
-```bash
-python test_colqwen_reproduction.py
-```
-
-This will:
-1. ✅ Check dependencies
-2. 📥 Download sample PDF (Attention Is All You Need paper)
-3. 🏗️ Build test index
-4. 🔍 Run sample queries
-5. 📊 Show how to generate similarity maps
-
 ## 🎨 Advanced: Similarity Maps

 For visual similarity analysis, use the existing advanced script:
@@ -1,162 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to reproduce ColQwen results from issue #119
-https://github.com/yichuan-w/LEANN/issues/119
-
-This script demonstrates the ColQwen workflow:
-1. Download sample PDF
-2. Convert to images
-3. Build multimodal index
-4. Run test queries
-5. Generate similarity maps
-"""
-
-import importlib.util
-import os
-from pathlib import Path
-
-
-def main():
-    print("🧪 ColQwen Reproduction Test - Issue #119")
-    print("=" * 50)
-
-    # Check if we're in the right directory
-    repo_root = Path.cwd()
-    if not (repo_root / "apps" / "colqwen_rag.py").exists():
-        print("❌ Please run this script from the LEANN repository root")
-        print("   cd /path/to/LEANN && python test_colqwen_reproduction.py")
-        return
-
-    print("✅ Repository structure looks good")
-
-    # Step 1: Check dependencies
-    print("\n📦 Checking dependencies...")
-    try:
-        import torch
-
-        # Check if pdf2image is available
-        if importlib.util.find_spec("pdf2image") is None:
-            raise ImportError("pdf2image not found")
-        # Check if colpali_engine is available
-        if importlib.util.find_spec("colpali_engine") is None:
-            raise ImportError("colpali_engine not found")
-
-        print("✅ Core dependencies available")
-        print(f"   - PyTorch: {torch.__version__}")
-        print(f"   - CUDA available: {torch.cuda.is_available()}")
-        print(
-            f"   - MPS available: {hasattr(torch.backends, 'mps') and torch.backends.mps.is_available()}"
-        )
-    except ImportError as e:
-        print(f"❌ Missing dependency: {e}")
-        print("\n📥 Install missing dependencies:")
-        print(
-            "   uv pip install colpali_engine pdf2image pillow matplotlib qwen_vl_utils einops seaborn"
-        )
-        return
-
-    # Step 2: Download sample PDF
-    print("\n📄 Setting up sample PDF...")
-    pdf_dir = repo_root / "test_pdfs"
-    pdf_dir.mkdir(exist_ok=True)
-    sample_pdf = pdf_dir / "attention_paper.pdf"
-
-    if not sample_pdf.exists():
-        print("📥 Downloading sample paper (Attention Is All You Need)...")
-        import urllib.request
-
-        try:
-            urllib.request.urlretrieve("https://arxiv.org/pdf/1706.03762.pdf", sample_pdf)
-            print(f"✅ Downloaded: {sample_pdf}")
-        except Exception as e:
-            print(f"❌ Download failed: {e}")
-            print("   Please manually download a PDF to test_pdfs/attention_paper.pdf")
-            return
-    else:
-        print(f"✅ Using existing PDF: {sample_pdf}")
-
-    # Step 3: Test ColQwen RAG
-    print("\n🚀 Testing ColQwen RAG...")
-
-    # Build index
-    print("\n1️⃣ Building multimodal index...")
-    build_cmd = f"python -m apps.colqwen_rag build --pdfs {pdf_dir} --index test_attention --model colqwen2 --pages-dir test_pages"
-    print(f"   Command: {build_cmd}")
-
-    try:
-        result = os.system(build_cmd)
-        if result == 0:
-            print("✅ Index built successfully!")
-        else:
-            print("❌ Index building failed")
-            return
-    except Exception as e:
-        print(f"❌ Error building index: {e}")
-        return
-
-    # Test search
-    print("\n2️⃣ Testing search...")
-    test_queries = [
-        "How does attention mechanism work?",
-        "What is the transformer architecture?",
-        "How do you compute self-attention?",
-    ]
-
-    for query in test_queries:
-        print(f"\n🔍 Query: '{query}'")
-        search_cmd = f'python -m apps.colqwen_rag search test_attention "{query}" --top-k 3'
-        print(f"   Command: {search_cmd}")
-
-        try:
-            result = os.system(search_cmd)
-            if result == 0:
-                print("✅ Search completed")
-            else:
-                print("❌ Search failed")
-        except Exception as e:
-            print(f"❌ Search error: {e}")
-
-    # Test interactive mode (briefly)
-    print("\n3️⃣ Testing interactive mode...")
-    print("   You can test interactive mode with:")
-    print("   python -m apps.colqwen_rag ask test_attention --interactive")
-
-    # Step 4: Test similarity maps (using existing script)
-    print("\n4️⃣ Testing similarity maps...")
-    similarity_script = (
-        repo_root
-        / "apps"
-        / "multimodal"
-        / "vision-based-pdf-multi-vector"
-        / "multi-vector-leann-similarity-map.py"
-    )
-
-    if similarity_script.exists():
-        print("   You can generate similarity maps with:")
-        print(f"   cd {similarity_script.parent}")
-        print("   python multi-vector-leann-similarity-map.py")
-        print("   (Edit the script to use your local PDF)")
-
-    print("\n🎉 ColQwen reproduction test completed!")
-    print("\n📋 Summary:")
-    print("   ✅ Dependencies checked")
-    print("   ✅ Sample PDF prepared")
-    print("   ✅ Index building tested")
-    print("   ✅ Search functionality tested")
-    print("   ✅ Interactive mode available")
-    print("   ✅ Similarity maps available")
-
-    print("\n🔗 Related repositories to check:")
-    print("   - https://github.com/lightonai/fast-plaid")
-    print("   - https://github.com/lightonai/pylate")
-    print("   - https://github.com/stanford-futuredata/ColBERT")
-
-    print("\n📝 Next steps:")
-    print("   1. Test with your own PDFs")
-    print("   2. Experiment with different queries")
-    print("   3. Generate similarity maps for visual analysis")
-    print("   4. Compare ColQwen2 vs ColPali performance")
-
-
-if __name__ == "__main__":
-    main()