From 0d448c4a4187a35f084638641eb50861cbdfbd74 Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Mon, 4 Aug 2025 22:50:32 -0700 Subject: [PATCH] docs: config guidance (#17) * docs: config guidance * feat: add comprehensive configuration guide and update README - Create docs/configuration-guide.md with detailed guidance on: - Embedding model selection (small/medium/large) - Index selection (HNSW vs DiskANN) - LLM engine and model comparison - Parameter tuning (build/search complexity, top-k) - Performance optimization tips - Deep dive into LEANN's recomputation feature - Update README.md to link to the configuration guide - Include latest 2025 model recommendations (Qwen3, DeepSeek-R1, O3-mini) * chore: move evaluation data .gitattributes to correct location * docs: Weaken DiskANN emphasis in README - Change backend description to emphasize HNSW as default - DiskANN positioned as optional for billion-scale datasets - Simplify evaluation commands to be more generic * docs: Adjust DiskANN positioning in features and roadmap - features.md: Put HNSW/FAISS first as default, DiskANN as optional - roadmap.md: Reorder to show HNSW integration before DiskANN - Consistent with positioning DiskANN as advanced option for large-scale use * docs: Improve configuration guide based on feedback - List specific files in default data/ directory (2 AI papers, literature, tech report) - Update examples to use English and better RAG-suitable queries - Change full dataset reference to use --max-items -1 - Adjust small model guidance about upgrading to larger models when time allows - Update top-k defaults to reflect actual default of 20 - Ensure consistent use of full model name Qwen/Qwen3-Embedding-0.6B - Reorder optimization steps, move MLX to third position - Remove incorrect chunk size tuning guidance - Change README from 'Having trouble' to 'Need best practices' * docs: Address all configuration guide feedback - Fix grammar: 'If time is not a constraint' instead of 'time expense is not large' - Highlight Qwen3-Embedding-0.6B performance (nearly OpenAI API level) - Add OpenAI quick start section with configuration example - Fold Cloud vs Local trade-offs into collapsible section - Update HNSW as 'default and recommended for extreme low storage' - Add DiskANN beta warning and explain PQ+rerank architecture - Expand Ollama models: add qwen3:0.6b, 4b, 7b variants - Note OpenAI as current default but recommend Ollama switch - Add 'need to install extra software' warning for Ollama - Remove incorrect latency numbers from search-complexity recommendations * docs: add a link --- README.md | 7 +- {data => benchmarks/data}/.gitattributes | 0 docs/configuration-guide.md | 236 ++++++++++++++++++ docs/features.md | 2 +- docs/roadmap.md | 2 +- .../leann-backend-diskann/third_party/DiskANN | 2 +- 6 files changed, 243 insertions(+), 6 deletions(-) rename {data => benchmarks/data}/.gitattributes (100%) create mode 100644 docs/configuration-guide.md diff --git a/README.md b/README.md index 332c11c..5fa5248 100755 --- a/README.md +++ b/README.md @@ -170,6 +170,8 @@ ollama pull llama3.2:1b LEANN provides flexible parameters for embedding models, search strategies, and data processing to fit your specific needs. +📚 **Need configuration best practices?** Check our [Configuration Guide](docs/configuration-guide.md) for detailed optimization tips, model selection advice, and solutions to common issues like slow embeddings or poor search quality. +
📋 Click to expand: Common Parameters (Available in All Examples) @@ -514,7 +516,7 @@ Options: - **Dynamic batching:** Efficiently batch embedding computations for GPU utilization - **Two-level search:** Smart graph traversal that prioritizes promising nodes -**Backends:** DiskANN or HNSW - pick what works for your data size. +**Backends:** HNSW (default) for most use cases, with optional DiskANN support for billion-scale datasets. ## Benchmarks @@ -534,8 +536,7 @@ Options: ```bash uv pip install -e ".[dev]" # Install dev dependencies -python benchmarks/run_evaluation.py data/indices/dpr/dpr_diskann # DPR dataset -python benchmarks/run_evaluation.py data/indices/rpj_wiki/rpj_wiki.index # Wikipedia +python benchmarks/run_evaluation.py # Will auto-download evaluation data and run benchmarks ``` The evaluation script downloads data automatically on first run. The last three results were tested with partial personal data, and you can reproduce them with your own data! diff --git a/data/.gitattributes b/benchmarks/data/.gitattributes similarity index 100% rename from data/.gitattributes rename to benchmarks/data/.gitattributes diff --git a/docs/configuration-guide.md b/docs/configuration-guide.md new file mode 100644 index 0000000..1546440 --- /dev/null +++ b/docs/configuration-guide.md @@ -0,0 +1,236 @@ +# LEANN Configuration Guide + +This guide helps you optimize LEANN for different use cases and understand the trade-offs between various configuration options. + +## Getting Started: Simple is Better + +When first trying LEANN, start with a small dataset to quickly validate your approach: + +**For document RAG**: The default `data/` directory works perfectly - includes 2 AI research papers, Pride and Prejudice literature, and a technical report +```bash +python -m apps.document_rag --query "What techniques does LEANN use?" +``` + +**For other data sources**: Limit the dataset size for quick testing +```bash +# WeChat: Test with recent messages only +python -m apps.wechat_rag --max-items 100 --query "What did we discuss about the project timeline?" + +# Browser history: Last few days +python -m apps.browser_rag --max-items 500 --query "Find documentation about vector databases" + +# Email: Recent inbox +python -m apps.email_rag --max-items 200 --query "Who sent updates about the deployment status?" +``` + +Once validated, scale up gradually: +- 100 documents → 1,000 → 10,000 → full dataset (`--max-items -1`) +- This helps identify issues early before committing to long processing times + +## Embedding Model Selection: Understanding the Trade-offs + +Based on our experience developing LEANN, embedding models fall into three categories: + +### Small Models (< 100M parameters) +**Example**: `sentence-transformers/all-MiniLM-L6-v2` (22M params) +- **Pros**: Lightweight, fast for both indexing and inference +- **Cons**: Lower semantic understanding, may miss nuanced relationships +- **Use when**: Speed is critical, handling simple queries, interactive mode, or just experimenting with LEANN. If time is not a constraint, consider using a larger/better embedding model + +### Medium Models (100M-500M parameters) +**Example**: `facebook/contriever` (110M params), `BAAI/bge-base-en-v1.5` (110M params) +- **Pros**: Balanced performance, good multilingual support, reasonable speed +- **Cons**: Requires more compute than small models +- **Use when**: Need quality results without extreme compute requirements, general-purpose RAG applications + +### Large Models (500M+ parameters) +**Example**: `Qwen/Qwen3-Embedding-0.6B` (600M params), `intfloat/multilingual-e5-large` (560M params) +- **Pros**: Best semantic understanding, captures complex relationships, excellent multilingual support. **Qwen3-Embedding-0.6B achieves nearly OpenAI API performance!** +- **Cons**: Slower inference, longer index build times +- **Use when**: Quality is paramount and you have sufficient compute resources. **Highly recommended** for production use + +### Quick Start: OpenAI Embeddings (Fastest Setup) + +For immediate testing without local model downloads: +```bash +# Set OpenAI embeddings (requires OPENAI_API_KEY) +--embedding-mode openai --embedding-model text-embedding-3-small +``` + +
+Cloud vs Local Trade-offs + +**OpenAI Embeddings** (`text-embedding-3-small/large`) +- **Pros**: No local compute needed, consistently fast, high quality +- **Cons**: Requires API key, costs money, data leaves your system, [known limitations with certain languages](https://yichuan-w.github.io/blog/lessons_learned_in_dev_leann/) +- **When to use**: Prototyping, non-sensitive data, need immediate results + +**Local Embeddings** +- **Pros**: Complete privacy, no ongoing costs, full control, can sometimes outperform OpenAI embeddings +- **Cons**: Slower than cloud APIs, requires local compute resources +- **When to use**: Production systems, sensitive data, cost-sensitive applications + +
+ +## Index Selection: Matching Your Scale + +### HNSW (Hierarchical Navigable Small World) +**Best for**: Small to medium datasets (< 10M vectors) - **Default and recommended for extreme low storage** +- Full recomputation required +- High memory usage during build phase +- Excellent recall (95%+) + +```bash +# Optimal for most use cases +--backend-name hnsw --graph-degree 32 --build-complexity 64 +``` + +### DiskANN +**Best for**: Large datasets (> 10M vectors, 10GB+ index size) - **⚠️ Beta version, still in active development** +- Uses Product Quantization (PQ) for coarse filtering during graph traversal +- Novel approach: stores only PQ codes, performs rerank with exact computation in final step +- Implements a corner case of double-queue: prunes all neighbors and recomputes at the end + +```bash +# For billion-scale deployments +--backend-name diskann --graph-degree 64 --build-complexity 128 +``` + +## LLM Selection: Engine and Model Comparison + +### LLM Engines + +**OpenAI** (`--llm openai`) +- **Pros**: Best quality, consistent performance, no local resources needed +- **Cons**: Costs money ($0.15-2.5 per million tokens), requires internet, data privacy concerns +- **Models**: `gpt-4o-mini` (fast, cheap), `gpt-4o` (best quality), `o3-mini` (reasoning, not so expensive) +- **Note**: Our current default, but we recommend switching to Ollama for most use cases + +**Ollama** (`--llm ollama`) +- **Pros**: Fully local, free, privacy-preserving, good model variety +- **Cons**: Requires local GPU/CPU resources, slower than cloud APIs, need to install extra [ollama app](https://github.com/ollama/ollama?tab=readme-ov-file#ollama) and pre-download models by `ollama pull` +- **Models**: `qwen3:0.6b` (ultra-fast), `qwen3:1.7b` (balanced), `qwen3:4b` (good quality), `qwen3:7b` (high quality), `deepseek-r1:1.5b` (reasoning) + +**HuggingFace** (`--llm hf`) +- **Pros**: Free tier available, huge model selection, direct model loading (vs Ollama's server-based approach) +- **Cons**: More complex initial setup +- **Models**: `Qwen/Qwen3-1.7B-FP8` + +## Parameter Tuning Guide + +### Search Complexity Parameters + +**`--build-complexity`** (index building) +- Controls thoroughness during index construction +- Higher = better recall but slower build +- Recommendations: + - 32: Quick prototyping + - 64: Balanced (default) + - 128: Production systems + - 256: Maximum quality + +**`--search-complexity`** (query time) +- Controls search thoroughness +- Higher = better results but slower +- Recommendations: + - 16: Fast/Interactive search + - 32: High quality with diversity + - 64+: Maximum accuracy + +### Top-K Selection + +**`--top-k`** (number of retrieved chunks) +- More chunks = better context but slower LLM processing +- Should be always smaller than `--search-complexity` +- Guidelines: + - 10-20: General questions (default: 20) + - 30+: Complex multi-hop reasoning requiring comprehensive context + +**Trade-off formula**: +- Retrieval time ∝ log(n) × search_complexity +- LLM processing time ∝ top_k × chunk_size +- Total context = top_k × chunk_size tokens + +### Graph Degree (HNSW/DiskANN) + +**`--graph-degree`** +- Number of connections per node in the graph +- Higher = better recall but more memory +- HNSW: 16-32 (default: 32) +- DiskANN: 32-128 (default: 64) + + +## Performance Optimization Checklist + +### If Embedding is Too Slow + +1. **Switch to smaller model**: + ```bash + # From large model + --embedding-model Qwen/Qwen3-Embedding-0.6B + # To small model + --embedding-model sentence-transformers/all-MiniLM-L6-v2 + ``` + +2. **Limit dataset size for testing**: + ```bash + --max-items 1000 # Process first 1k items only + ``` + +3. **Use MLX on Apple Silicon** (optional optimization): + ```bash + --embedding-mode mlx --embedding-model mlx-community/multilingual-e5-base-mlx + ``` + +### If Search Quality is Poor + +1. **Increase retrieval count**: + ```bash + --top-k 30 # Retrieve more candidates + ``` + +2. **Upgrade embedding model**: + ```bash + # For English + --embedding-model BAAI/bge-base-en-v1.5 + # For multilingual + --embedding-model intfloat/multilingual-e5-large + ``` + +## Understanding the Trade-offs + +Every configuration choice involves trade-offs: + +| Factor | Small/Fast | Large/Quality | +|--------|------------|---------------| +| Embedding Model | `all-MiniLM-L6-v2` | `Qwen/Qwen3-Embedding-0.6B` | +| Chunk Size | 512 tokens | 128 tokens | +| Index Type | HNSW | DiskANN | +| LLM | `qwen3:1.7b` | `gpt-4o` | + +The key is finding the right balance for your specific use case. Start small and simple, measure performance, then scale up only where needed. + +## Deep Dive: Critical Configuration Decisions + +### When to Disable Recomputation + +LEANN's recomputation feature provides exact distance calculations but can be disabled for extreme QPS requirements: + +```bash +--no-recompute # Disable selective recomputation +``` + +**Trade-offs**: +- **With recomputation** (default): Exact distances, best quality, higher latency, minimal storage (only stores metadata, recomputes embeddings on-demand) +- **Without recomputation**: Must store full embeddings, significantly higher memory and storage usage (10-100x more), but faster search + +**Disable when**: +- You have abundant storage and memory +- Need extremely low latency (< 100ms) +- Running a read-heavy workload where storage cost is acceptable + +## Further Reading + +- [Lessons Learned Developing LEANN](https://yichuan-w.github.io/blog/lessons_learned_in_dev_leann/) +- [LEANN Technical Paper](https://arxiv.org/abs/2506.08276) +- [DiskANN Original Paper](https://papers.nips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf) diff --git a/docs/features.md b/docs/features.md index 51c0c4f..da4e495 100644 --- a/docs/features.md +++ b/docs/features.md @@ -5,7 +5,7 @@ - **🔄 Real-time Embeddings** - Eliminate heavy embedding storage with dynamic computation using optimized ZMQ servers and highly optimized search paradigm (overlapping and batching) with highly optimized embedding engine - **📈 Scalable Architecture** - Handles millions of documents on consumer hardware; the larger your dataset, the more LEANN can save - **🎯 Graph Pruning** - Advanced techniques to minimize the storage overhead of vector search to a limited footprint -- **🏗️ Pluggable Backends** - DiskANN, HNSW/FAISS with unified API +- **🏗️ Pluggable Backends** - HNSW/FAISS (default), with optional DiskANN for large-scale deployments ## 🛠️ Technical Highlights - **🔄 Recompute Mode** - Highest accuracy scenarios while eliminating vector storage overhead diff --git a/docs/roadmap.md b/docs/roadmap.md index c9446df..fa04b5c 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -2,8 +2,8 @@ ## 🎯 Q2 2025 -- [X] DiskANN backend with MIPS/L2/Cosine support - [X] HNSW backend integration +- [X] DiskANN backend with MIPS/L2/Cosine support - [X] Real-time embedding pipeline - [X] Memory-efficient graph pruning diff --git a/packages/leann-backend-diskann/third_party/DiskANN b/packages/leann-backend-diskann/third_party/DiskANN index af2a264..67a2611 160000 --- a/packages/leann-backend-diskann/third_party/DiskANN +++ b/packages/leann-backend-diskann/third_party/DiskANN @@ -1 +1 @@ -Subproject commit af2a26481e65232b57b82d96e68833cdee9f7635 +Subproject commit 67a2611ad14bc11d84dfdb554c5567cfb78a2656