From a1c21adbced2ec957bc124d5351b07b45ea09ecc Mon Sep 17 00:00:00 2001 From: aakash Date: Fri, 19 Dec 2025 13:57:47 -0800 Subject: [PATCH] Move COLQWEN_GUIDE.md to docs and remove test_colqwen_reproduction.py --- COLQWEN_GUIDE.md | 200 ----------------------------------- test_colqwen_reproduction.py | 162 ---------------------------- 2 files changed, 362 deletions(-) delete mode 100644 COLQWEN_GUIDE.md delete mode 100644 test_colqwen_reproduction.py diff --git a/COLQWEN_GUIDE.md b/COLQWEN_GUIDE.md deleted file mode 100644 index 42772f6..0000000 --- a/COLQWEN_GUIDE.md +++ /dev/null @@ -1,200 +0,0 @@ -# ColQwen Integration Guide - -Easy-to-use multimodal PDF retrieval with ColQwen2/ColPali models. - -## Quick Start - -> **๐ŸŽ Mac Users**: ColQwen is optimized for Apple Silicon with MPS acceleration for faster inference! - -### 1. Install Dependencies -```bash -uv pip install colpali_engine pdf2image pillow matplotlib qwen_vl_utils einops seaborn -brew install poppler # macOS only, for PDF processing -``` - -### 2. Basic Usage -```bash -# Build index from PDFs -python -m apps.colqwen_rag build --pdfs ./my_papers/ --index research_papers - -# Search with text queries -python -m apps.colqwen_rag search research_papers "How does attention mechanism work?" - -# Interactive Q&A -python -m apps.colqwen_rag ask research_papers --interactive -``` - -## Commands - -### Build Index -```bash -python -m apps.colqwen_rag build \ - --pdfs ./pdf_directory/ \ - --index my_index \ - --model colqwen2 \ - --pages-dir ./page_images/ # Optional: save page images -``` - -**Options:** -- `--pdfs`: Directory containing PDF files (or single PDF path) -- `--index`: Name for the index (required) -- `--model`: `colqwen2` (default) or `colpali` -- `--pages-dir`: Directory to save page images (optional) - -### Search Index -```bash -python -m apps.colqwen_rag search my_index "your question here" --top-k 5 -``` - -**Options:** -- `--top-k`: Number of results to return (default: 5) -- `--model`: Model used for search (should match build model) - -### Interactive Q&A -```bash -python -m apps.colqwen_rag ask my_index --interactive -``` - -**Commands in interactive mode:** -- Type your questions naturally -- `help`: Show available commands -- `quit`/`exit`/`q`: Exit interactive mode - -## ๐Ÿงช Test & Reproduce Results - -Run the reproduction test for issue #119: -```bash -python test_colqwen_reproduction.py -``` - -This will: -1. โœ… Check dependencies -2. ๐Ÿ“ฅ Download sample PDF (Attention Is All You Need paper) -3. ๐Ÿ—๏ธ Build test index -4. ๐Ÿ” Run sample queries -5. ๐Ÿ“Š Show how to generate similarity maps - -## ๐ŸŽจ Advanced: Similarity Maps - -For visual similarity analysis, use the existing advanced script: -```bash -cd apps/multimodal/vision-based-pdf-multi-vector/ -python multi-vector-leann-similarity-map.py -``` - -Edit the script to customize: -- `QUERY`: Your question -- `MODEL`: "colqwen2" or "colpali" -- `USE_HF_DATASET`: Use HuggingFace dataset or local PDFs -- `SIMILARITY_MAP`: Generate heatmaps -- `ANSWER`: Enable Qwen-VL answer generation - -## ๐Ÿ”ง How It Works - -### ColQwen2 vs ColPali -- **ColQwen2** (`vidore/colqwen2-v1.0`): Latest vision-language model -- **ColPali** (`vidore/colpali-v1.2`): Proven multimodal retriever - -### Architecture -1. **PDF โ†’ Images**: Convert PDF pages to images (150 DPI) -2. **Vision Encoding**: Process images with ColQwen2/ColPali -3. **Multi-Vector Index**: Build LEANN HNSW index with multiple embeddings per page -4. **Query Processing**: Encode text queries with same model -5. **Similarity Search**: Find most relevant pages/regions -6. **Visual Maps**: Generate attention heatmaps (optional) - -### Device Support -- **CUDA**: Best performance with GPU acceleration -- **MPS**: Apple Silicon Mac support -- **CPU**: Fallback for any system (slower) - -Auto-detection: CUDA > MPS > CPU - -## ๐Ÿ“Š Performance Tips - -### For Best Performance: -```bash -# Use ColQwen2 for latest features ---model colqwen2 - -# Save page images for reuse ---pages-dir ./cached_pages/ - -# Adjust batch size based on GPU memory -# (automatically handled) -``` - -### For Large Document Sets: -- Process PDFs in batches -- Use SSD storage for index files -- Consider using CUDA if available - -## ๐Ÿ”— Related Resources - -- **Fast-PLAID**: https://github.com/lightonai/fast-plaid -- **Pylate**: https://github.com/lightonai/pylate -- **ColBERT**: https://github.com/stanford-futuredata/ColBERT -- **ColPali Paper**: Vision-Language Models for Document Retrieval -- **Issue #119**: https://github.com/yichuan-w/LEANN/issues/119 - -## ๐Ÿ› Troubleshooting - -### PDF Conversion Issues (macOS) -```bash -# Install poppler -brew install poppler -which pdfinfo && pdfinfo -v -``` - -### Memory Issues -- Reduce batch size (automatically handled) -- Use CPU instead of GPU: `export CUDA_VISIBLE_DEVICES=""` -- Process fewer PDFs at once - -### Model Download Issues -- Ensure internet connection for first run -- Models are cached after first download -- Use HuggingFace mirrors if needed - -### Import Errors -```bash -# Ensure all dependencies installed -uv pip install colpali_engine pdf2image pillow matplotlib qwen_vl_utils einops seaborn - -# Check PyTorch installation -python -c "import torch; print(torch.__version__)" -``` - -## ๐Ÿ’ก Examples - -### Research Paper Analysis -```bash -# Index your research papers -python -m apps.colqwen_rag build --pdfs ~/Papers/AI/ --index ai_papers - -# Ask research questions -python -m apps.colqwen_rag search ai_papers "What are the limitations of transformer models?" -python -m apps.colqwen_rag search ai_papers "How does BERT compare to GPT?" -``` - -### Document Q&A -```bash -# Index business documents -python -m apps.colqwen_rag build --pdfs ~/Documents/Reports/ --index reports - -# Interactive analysis -python -m apps.colqwen_rag ask reports --interactive -``` - -### Visual Analysis -```bash -# Generate similarity maps for specific queries -cd apps/multimodal/vision-based-pdf-multi-vector/ -# Edit multi-vector-leann-similarity-map.py with your query -python multi-vector-leann-similarity-map.py -# Check ./figures/ for generated heatmaps -``` - ---- - -**๐ŸŽฏ This integration makes ColQwen as easy to use as other LEANN features while maintaining the full power of multimodal document understanding!** diff --git a/test_colqwen_reproduction.py b/test_colqwen_reproduction.py deleted file mode 100644 index 1e38d30..0000000 --- a/test_colqwen_reproduction.py +++ /dev/null @@ -1,162 +0,0 @@ -#!/usr/bin/env python3 -""" -Test script to reproduce ColQwen results from issue #119 -https://github.com/yichuan-w/LEANN/issues/119 - -This script demonstrates the ColQwen workflow: -1. Download sample PDF -2. Convert to images -3. Build multimodal index -4. Run test queries -5. Generate similarity maps -""" - -import importlib.util -import os -from pathlib import Path - - -def main(): - print("๐Ÿงช ColQwen Reproduction Test - Issue #119") - print("=" * 50) - - # Check if we're in the right directory - repo_root = Path.cwd() - if not (repo_root / "apps" / "colqwen_rag.py").exists(): - print("โŒ Please run this script from the LEANN repository root") - print(" cd /path/to/LEANN && python test_colqwen_reproduction.py") - return - - print("โœ… Repository structure looks good") - - # Step 1: Check dependencies - print("\n๐Ÿ“ฆ Checking dependencies...") - try: - import torch - - # Check if pdf2image is available - if importlib.util.find_spec("pdf2image") is None: - raise ImportError("pdf2image not found") - # Check if colpali_engine is available - if importlib.util.find_spec("colpali_engine") is None: - raise ImportError("colpali_engine not found") - - print("โœ… Core dependencies available") - print(f" - PyTorch: {torch.__version__}") - print(f" - CUDA available: {torch.cuda.is_available()}") - print( - f" - MPS available: {hasattr(torch.backends, 'mps') and torch.backends.mps.is_available()}" - ) - except ImportError as e: - print(f"โŒ Missing dependency: {e}") - print("\n๐Ÿ“ฅ Install missing dependencies:") - print( - " uv pip install colpali_engine pdf2image pillow matplotlib qwen_vl_utils einops seaborn" - ) - return - - # Step 2: Download sample PDF - print("\n๐Ÿ“„ Setting up sample PDF...") - pdf_dir = repo_root / "test_pdfs" - pdf_dir.mkdir(exist_ok=True) - sample_pdf = pdf_dir / "attention_paper.pdf" - - if not sample_pdf.exists(): - print("๐Ÿ“ฅ Downloading sample paper (Attention Is All You Need)...") - import urllib.request - - try: - urllib.request.urlretrieve("https://arxiv.org/pdf/1706.03762.pdf", sample_pdf) - print(f"โœ… Downloaded: {sample_pdf}") - except Exception as e: - print(f"โŒ Download failed: {e}") - print(" Please manually download a PDF to test_pdfs/attention_paper.pdf") - return - else: - print(f"โœ… Using existing PDF: {sample_pdf}") - - # Step 3: Test ColQwen RAG - print("\n๐Ÿš€ Testing ColQwen RAG...") - - # Build index - print("\n1๏ธโƒฃ Building multimodal index...") - build_cmd = f"python -m apps.colqwen_rag build --pdfs {pdf_dir} --index test_attention --model colqwen2 --pages-dir test_pages" - print(f" Command: {build_cmd}") - - try: - result = os.system(build_cmd) - if result == 0: - print("โœ… Index built successfully!") - else: - print("โŒ Index building failed") - return - except Exception as e: - print(f"โŒ Error building index: {e}") - return - - # Test search - print("\n2๏ธโƒฃ Testing search...") - test_queries = [ - "How does attention mechanism work?", - "What is the transformer architecture?", - "How do you compute self-attention?", - ] - - for query in test_queries: - print(f"\n๐Ÿ” Query: '{query}'") - search_cmd = f'python -m apps.colqwen_rag search test_attention "{query}" --top-k 3' - print(f" Command: {search_cmd}") - - try: - result = os.system(search_cmd) - if result == 0: - print("โœ… Search completed") - else: - print("โŒ Search failed") - except Exception as e: - print(f"โŒ Search error: {e}") - - # Test interactive mode (briefly) - print("\n3๏ธโƒฃ Testing interactive mode...") - print(" You can test interactive mode with:") - print(" python -m apps.colqwen_rag ask test_attention --interactive") - - # Step 4: Test similarity maps (using existing script) - print("\n4๏ธโƒฃ Testing similarity maps...") - similarity_script = ( - repo_root - / "apps" - / "multimodal" - / "vision-based-pdf-multi-vector" - / "multi-vector-leann-similarity-map.py" - ) - - if similarity_script.exists(): - print(" You can generate similarity maps with:") - print(f" cd {similarity_script.parent}") - print(" python multi-vector-leann-similarity-map.py") - print(" (Edit the script to use your local PDF)") - - print("\n๐ŸŽ‰ ColQwen reproduction test completed!") - print("\n๐Ÿ“‹ Summary:") - print(" โœ… Dependencies checked") - print(" โœ… Sample PDF prepared") - print(" โœ… Index building tested") - print(" โœ… Search functionality tested") - print(" โœ… Interactive mode available") - print(" โœ… Similarity maps available") - - print("\n๐Ÿ”— Related repositories to check:") - print(" - https://github.com/lightonai/fast-plaid") - print(" - https://github.com/lightonai/pylate") - print(" - https://github.com/stanford-futuredata/ColBERT") - - print("\n๐Ÿ“ Next steps:") - print(" 1. Test with your own PDFs") - print(" 2. Experiment with different queries") - print(" 3. Generate similarity maps for visual analysis") - print(" 4. Compare ColQwen2 vs ColPali performance") - - -if __name__ == "__main__": - main()