Compare commits
1 Commits
main
...
feature/co
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
360fdf575c |
200
COLQWEN_GUIDE.md
200
COLQWEN_GUIDE.md
@@ -1,200 +0,0 @@
|
|||||||
# ColQwen Integration Guide
|
|
||||||
|
|
||||||
Easy-to-use multimodal PDF retrieval with ColQwen2/ColPali models.
|
|
||||||
|
|
||||||
## Quick Start
|
|
||||||
|
|
||||||
> **🍎 Mac Users**: ColQwen is optimized for Apple Silicon with MPS acceleration for faster inference!
|
|
||||||
|
|
||||||
### 1. Install Dependencies
|
|
||||||
```bash
|
|
||||||
uv pip install colpali_engine pdf2image pillow matplotlib qwen_vl_utils einops seaborn
|
|
||||||
brew install poppler # macOS only, for PDF processing
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Basic Usage
|
|
||||||
```bash
|
|
||||||
# Build index from PDFs
|
|
||||||
python -m apps.colqwen_rag build --pdfs ./my_papers/ --index research_papers
|
|
||||||
|
|
||||||
# Search with text queries
|
|
||||||
python -m apps.colqwen_rag search research_papers "How does attention mechanism work?"
|
|
||||||
|
|
||||||
# Interactive Q&A
|
|
||||||
python -m apps.colqwen_rag ask research_papers --interactive
|
|
||||||
```
|
|
||||||
|
|
||||||
## Commands
|
|
||||||
|
|
||||||
### Build Index
|
|
||||||
```bash
|
|
||||||
python -m apps.colqwen_rag build \
|
|
||||||
--pdfs ./pdf_directory/ \
|
|
||||||
--index my_index \
|
|
||||||
--model colqwen2 \
|
|
||||||
--pages-dir ./page_images/ # Optional: save page images
|
|
||||||
```
|
|
||||||
|
|
||||||
**Options:**
|
|
||||||
- `--pdfs`: Directory containing PDF files (or single PDF path)
|
|
||||||
- `--index`: Name for the index (required)
|
|
||||||
- `--model`: `colqwen2` (default) or `colpali`
|
|
||||||
- `--pages-dir`: Directory to save page images (optional)
|
|
||||||
|
|
||||||
### Search Index
|
|
||||||
```bash
|
|
||||||
python -m apps.colqwen_rag search my_index "your question here" --top-k 5
|
|
||||||
```
|
|
||||||
|
|
||||||
**Options:**
|
|
||||||
- `--top-k`: Number of results to return (default: 5)
|
|
||||||
- `--model`: Model used for search (should match build model)
|
|
||||||
|
|
||||||
### Interactive Q&A
|
|
||||||
```bash
|
|
||||||
python -m apps.colqwen_rag ask my_index --interactive
|
|
||||||
```
|
|
||||||
|
|
||||||
**Commands in interactive mode:**
|
|
||||||
- Type your questions naturally
|
|
||||||
- `help`: Show available commands
|
|
||||||
- `quit`/`exit`/`q`: Exit interactive mode
|
|
||||||
|
|
||||||
## 🧪 Test & Reproduce Results
|
|
||||||
|
|
||||||
Run the reproduction test for issue #119:
|
|
||||||
```bash
|
|
||||||
python test_colqwen_reproduction.py
|
|
||||||
```
|
|
||||||
|
|
||||||
This will:
|
|
||||||
1. ✅ Check dependencies
|
|
||||||
2. 📥 Download sample PDF (Attention Is All You Need paper)
|
|
||||||
3. 🏗️ Build test index
|
|
||||||
4. 🔍 Run sample queries
|
|
||||||
5. 📊 Show how to generate similarity maps
|
|
||||||
|
|
||||||
## 🎨 Advanced: Similarity Maps
|
|
||||||
|
|
||||||
For visual similarity analysis, use the existing advanced script:
|
|
||||||
```bash
|
|
||||||
cd apps/multimodal/vision-based-pdf-multi-vector/
|
|
||||||
python multi-vector-leann-similarity-map.py
|
|
||||||
```
|
|
||||||
|
|
||||||
Edit the script to customize:
|
|
||||||
- `QUERY`: Your question
|
|
||||||
- `MODEL`: "colqwen2" or "colpali"
|
|
||||||
- `USE_HF_DATASET`: Use HuggingFace dataset or local PDFs
|
|
||||||
- `SIMILARITY_MAP`: Generate heatmaps
|
|
||||||
- `ANSWER`: Enable Qwen-VL answer generation
|
|
||||||
|
|
||||||
## 🔧 How It Works
|
|
||||||
|
|
||||||
### ColQwen2 vs ColPali
|
|
||||||
- **ColQwen2** (`vidore/colqwen2-v1.0`): Latest vision-language model
|
|
||||||
- **ColPali** (`vidore/colpali-v1.2`): Proven multimodal retriever
|
|
||||||
|
|
||||||
### Architecture
|
|
||||||
1. **PDF → Images**: Convert PDF pages to images (150 DPI)
|
|
||||||
2. **Vision Encoding**: Process images with ColQwen2/ColPali
|
|
||||||
3. **Multi-Vector Index**: Build LEANN HNSW index with multiple embeddings per page
|
|
||||||
4. **Query Processing**: Encode text queries with same model
|
|
||||||
5. **Similarity Search**: Find most relevant pages/regions
|
|
||||||
6. **Visual Maps**: Generate attention heatmaps (optional)
|
|
||||||
|
|
||||||
### Device Support
|
|
||||||
- **CUDA**: Best performance with GPU acceleration
|
|
||||||
- **MPS**: Apple Silicon Mac support
|
|
||||||
- **CPU**: Fallback for any system (slower)
|
|
||||||
|
|
||||||
Auto-detection: CUDA > MPS > CPU
|
|
||||||
|
|
||||||
## 📊 Performance Tips
|
|
||||||
|
|
||||||
### For Best Performance:
|
|
||||||
```bash
|
|
||||||
# Use ColQwen2 for latest features
|
|
||||||
--model colqwen2
|
|
||||||
|
|
||||||
# Save page images for reuse
|
|
||||||
--pages-dir ./cached_pages/
|
|
||||||
|
|
||||||
# Adjust batch size based on GPU memory
|
|
||||||
# (automatically handled)
|
|
||||||
```
|
|
||||||
|
|
||||||
### For Large Document Sets:
|
|
||||||
- Process PDFs in batches
|
|
||||||
- Use SSD storage for index files
|
|
||||||
- Consider using CUDA if available
|
|
||||||
|
|
||||||
## 🔗 Related Resources
|
|
||||||
|
|
||||||
- **Fast-PLAID**: https://github.com/lightonai/fast-plaid
|
|
||||||
- **Pylate**: https://github.com/lightonai/pylate
|
|
||||||
- **ColBERT**: https://github.com/stanford-futuredata/ColBERT
|
|
||||||
- **ColPali Paper**: Vision-Language Models for Document Retrieval
|
|
||||||
- **Issue #119**: https://github.com/yichuan-w/LEANN/issues/119
|
|
||||||
|
|
||||||
## 🐛 Troubleshooting
|
|
||||||
|
|
||||||
### PDF Conversion Issues (macOS)
|
|
||||||
```bash
|
|
||||||
# Install poppler
|
|
||||||
brew install poppler
|
|
||||||
which pdfinfo && pdfinfo -v
|
|
||||||
```
|
|
||||||
|
|
||||||
### Memory Issues
|
|
||||||
- Reduce batch size (automatically handled)
|
|
||||||
- Use CPU instead of GPU: `export CUDA_VISIBLE_DEVICES=""`
|
|
||||||
- Process fewer PDFs at once
|
|
||||||
|
|
||||||
### Model Download Issues
|
|
||||||
- Ensure internet connection for first run
|
|
||||||
- Models are cached after first download
|
|
||||||
- Use HuggingFace mirrors if needed
|
|
||||||
|
|
||||||
### Import Errors
|
|
||||||
```bash
|
|
||||||
# Ensure all dependencies installed
|
|
||||||
uv pip install colpali_engine pdf2image pillow matplotlib qwen_vl_utils einops seaborn
|
|
||||||
|
|
||||||
# Check PyTorch installation
|
|
||||||
python -c "import torch; print(torch.__version__)"
|
|
||||||
```
|
|
||||||
|
|
||||||
## 💡 Examples
|
|
||||||
|
|
||||||
### Research Paper Analysis
|
|
||||||
```bash
|
|
||||||
# Index your research papers
|
|
||||||
python -m apps.colqwen_rag build --pdfs ~/Papers/AI/ --index ai_papers
|
|
||||||
|
|
||||||
# Ask research questions
|
|
||||||
python -m apps.colqwen_rag search ai_papers "What are the limitations of transformer models?"
|
|
||||||
python -m apps.colqwen_rag search ai_papers "How does BERT compare to GPT?"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Document Q&A
|
|
||||||
```bash
|
|
||||||
# Index business documents
|
|
||||||
python -m apps.colqwen_rag build --pdfs ~/Documents/Reports/ --index reports
|
|
||||||
|
|
||||||
# Interactive analysis
|
|
||||||
python -m apps.colqwen_rag ask reports --interactive
|
|
||||||
```
|
|
||||||
|
|
||||||
### Visual Analysis
|
|
||||||
```bash
|
|
||||||
# Generate similarity maps for specific queries
|
|
||||||
cd apps/multimodal/vision-based-pdf-multi-vector/
|
|
||||||
# Edit multi-vector-leann-similarity-map.py with your query
|
|
||||||
python multi-vector-leann-similarity-map.py
|
|
||||||
# Check ./figures/ for generated heatmaps
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**🎯 This integration makes ColQwen as easy to use as other LEANN features while maintaining the full power of multimodal document understanding!**
|
|
||||||
@@ -60,20 +60,6 @@ python -m apps.colqwen_rag ask my_index --interactive
|
|||||||
- `help`: Show available commands
|
- `help`: Show available commands
|
||||||
- `quit`/`exit`/`q`: Exit interactive mode
|
- `quit`/`exit`/`q`: Exit interactive mode
|
||||||
|
|
||||||
## 🧪 Test & Reproduce Results
|
|
||||||
|
|
||||||
Run the reproduction test for issue #119:
|
|
||||||
```bash
|
|
||||||
python test_colqwen_reproduction.py
|
|
||||||
```
|
|
||||||
|
|
||||||
This will:
|
|
||||||
1. ✅ Check dependencies
|
|
||||||
2. 📥 Download sample PDF (Attention Is All You Need paper)
|
|
||||||
3. 🏗️ Build test index
|
|
||||||
4. 🔍 Run sample queries
|
|
||||||
5. 📊 Show how to generate similarity maps
|
|
||||||
|
|
||||||
## 🎨 Advanced: Similarity Maps
|
## 🎨 Advanced: Similarity Maps
|
||||||
|
|
||||||
For visual similarity analysis, use the existing advanced script:
|
For visual similarity analysis, use the existing advanced script:
|
||||||
|
|||||||
@@ -1,162 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Test script to reproduce ColQwen results from issue #119
|
|
||||||
https://github.com/yichuan-w/LEANN/issues/119
|
|
||||||
|
|
||||||
This script demonstrates the ColQwen workflow:
|
|
||||||
1. Download sample PDF
|
|
||||||
2. Convert to images
|
|
||||||
3. Build multimodal index
|
|
||||||
4. Run test queries
|
|
||||||
5. Generate similarity maps
|
|
||||||
"""
|
|
||||||
|
|
||||||
import importlib.util
|
|
||||||
import os
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
print("🧪 ColQwen Reproduction Test - Issue #119")
|
|
||||||
print("=" * 50)
|
|
||||||
|
|
||||||
# Check if we're in the right directory
|
|
||||||
repo_root = Path.cwd()
|
|
||||||
if not (repo_root / "apps" / "colqwen_rag.py").exists():
|
|
||||||
print("❌ Please run this script from the LEANN repository root")
|
|
||||||
print(" cd /path/to/LEANN && python test_colqwen_reproduction.py")
|
|
||||||
return
|
|
||||||
|
|
||||||
print("✅ Repository structure looks good")
|
|
||||||
|
|
||||||
# Step 1: Check dependencies
|
|
||||||
print("\n📦 Checking dependencies...")
|
|
||||||
try:
|
|
||||||
import torch
|
|
||||||
|
|
||||||
# Check if pdf2image is available
|
|
||||||
if importlib.util.find_spec("pdf2image") is None:
|
|
||||||
raise ImportError("pdf2image not found")
|
|
||||||
# Check if colpali_engine is available
|
|
||||||
if importlib.util.find_spec("colpali_engine") is None:
|
|
||||||
raise ImportError("colpali_engine not found")
|
|
||||||
|
|
||||||
print("✅ Core dependencies available")
|
|
||||||
print(f" - PyTorch: {torch.__version__}")
|
|
||||||
print(f" - CUDA available: {torch.cuda.is_available()}")
|
|
||||||
print(
|
|
||||||
f" - MPS available: {hasattr(torch.backends, 'mps') and torch.backends.mps.is_available()}"
|
|
||||||
)
|
|
||||||
except ImportError as e:
|
|
||||||
print(f"❌ Missing dependency: {e}")
|
|
||||||
print("\n📥 Install missing dependencies:")
|
|
||||||
print(
|
|
||||||
" uv pip install colpali_engine pdf2image pillow matplotlib qwen_vl_utils einops seaborn"
|
|
||||||
)
|
|
||||||
return
|
|
||||||
|
|
||||||
# Step 2: Download sample PDF
|
|
||||||
print("\n📄 Setting up sample PDF...")
|
|
||||||
pdf_dir = repo_root / "test_pdfs"
|
|
||||||
pdf_dir.mkdir(exist_ok=True)
|
|
||||||
sample_pdf = pdf_dir / "attention_paper.pdf"
|
|
||||||
|
|
||||||
if not sample_pdf.exists():
|
|
||||||
print("📥 Downloading sample paper (Attention Is All You Need)...")
|
|
||||||
import urllib.request
|
|
||||||
|
|
||||||
try:
|
|
||||||
urllib.request.urlretrieve("https://arxiv.org/pdf/1706.03762.pdf", sample_pdf)
|
|
||||||
print(f"✅ Downloaded: {sample_pdf}")
|
|
||||||
except Exception as e:
|
|
||||||
print(f"❌ Download failed: {e}")
|
|
||||||
print(" Please manually download a PDF to test_pdfs/attention_paper.pdf")
|
|
||||||
return
|
|
||||||
else:
|
|
||||||
print(f"✅ Using existing PDF: {sample_pdf}")
|
|
||||||
|
|
||||||
# Step 3: Test ColQwen RAG
|
|
||||||
print("\n🚀 Testing ColQwen RAG...")
|
|
||||||
|
|
||||||
# Build index
|
|
||||||
print("\n1️⃣ Building multimodal index...")
|
|
||||||
build_cmd = f"python -m apps.colqwen_rag build --pdfs {pdf_dir} --index test_attention --model colqwen2 --pages-dir test_pages"
|
|
||||||
print(f" Command: {build_cmd}")
|
|
||||||
|
|
||||||
try:
|
|
||||||
result = os.system(build_cmd)
|
|
||||||
if result == 0:
|
|
||||||
print("✅ Index built successfully!")
|
|
||||||
else:
|
|
||||||
print("❌ Index building failed")
|
|
||||||
return
|
|
||||||
except Exception as e:
|
|
||||||
print(f"❌ Error building index: {e}")
|
|
||||||
return
|
|
||||||
|
|
||||||
# Test search
|
|
||||||
print("\n2️⃣ Testing search...")
|
|
||||||
test_queries = [
|
|
||||||
"How does attention mechanism work?",
|
|
||||||
"What is the transformer architecture?",
|
|
||||||
"How do you compute self-attention?",
|
|
||||||
]
|
|
||||||
|
|
||||||
for query in test_queries:
|
|
||||||
print(f"\n🔍 Query: '{query}'")
|
|
||||||
search_cmd = f'python -m apps.colqwen_rag search test_attention "{query}" --top-k 3'
|
|
||||||
print(f" Command: {search_cmd}")
|
|
||||||
|
|
||||||
try:
|
|
||||||
result = os.system(search_cmd)
|
|
||||||
if result == 0:
|
|
||||||
print("✅ Search completed")
|
|
||||||
else:
|
|
||||||
print("❌ Search failed")
|
|
||||||
except Exception as e:
|
|
||||||
print(f"❌ Search error: {e}")
|
|
||||||
|
|
||||||
# Test interactive mode (briefly)
|
|
||||||
print("\n3️⃣ Testing interactive mode...")
|
|
||||||
print(" You can test interactive mode with:")
|
|
||||||
print(" python -m apps.colqwen_rag ask test_attention --interactive")
|
|
||||||
|
|
||||||
# Step 4: Test similarity maps (using existing script)
|
|
||||||
print("\n4️⃣ Testing similarity maps...")
|
|
||||||
similarity_script = (
|
|
||||||
repo_root
|
|
||||||
/ "apps"
|
|
||||||
/ "multimodal"
|
|
||||||
/ "vision-based-pdf-multi-vector"
|
|
||||||
/ "multi-vector-leann-similarity-map.py"
|
|
||||||
)
|
|
||||||
|
|
||||||
if similarity_script.exists():
|
|
||||||
print(" You can generate similarity maps with:")
|
|
||||||
print(f" cd {similarity_script.parent}")
|
|
||||||
print(" python multi-vector-leann-similarity-map.py")
|
|
||||||
print(" (Edit the script to use your local PDF)")
|
|
||||||
|
|
||||||
print("\n🎉 ColQwen reproduction test completed!")
|
|
||||||
print("\n📋 Summary:")
|
|
||||||
print(" ✅ Dependencies checked")
|
|
||||||
print(" ✅ Sample PDF prepared")
|
|
||||||
print(" ✅ Index building tested")
|
|
||||||
print(" ✅ Search functionality tested")
|
|
||||||
print(" ✅ Interactive mode available")
|
|
||||||
print(" ✅ Similarity maps available")
|
|
||||||
|
|
||||||
print("\n🔗 Related repositories to check:")
|
|
||||||
print(" - https://github.com/lightonai/fast-plaid")
|
|
||||||
print(" - https://github.com/lightonai/pylate")
|
|
||||||
print(" - https://github.com/stanford-futuredata/ColBERT")
|
|
||||||
|
|
||||||
print("\n📝 Next steps:")
|
|
||||||
print(" 1. Test with your own PDFs")
|
|
||||||
print(" 2. Experiment with different queries")
|
|
||||||
print(" 3. Generate similarity maps for visual analysis")
|
|
||||||
print(" 4. Compare ColQwen2 vs ColPali performance")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
Reference in New Issue
Block a user