diff --git a/README.md b/README.md index 332c11c..67ed35a 100755 --- a/README.md +++ b/README.md @@ -170,6 +170,13 @@ ollama pull llama3.2:1b LEANN provides flexible parameters for embedding models, search strategies, and data processing to fit your specific needs. +📚 **Having trouble with configuration?** Check our [Configuration Guide](docs/configuration-guide.md) for: +- Quick start configurations for each use case +- Solutions for "embedding too slow" issues +- How to choose the right chat model +- Fixing poor search quality +- Performance optimization tips +
📋 Click to expand: Common Parameters (Available in All Examples) diff --git a/docs/configuration-guide.md b/docs/configuration-guide.md new file mode 100644 index 0000000..d949a16 --- /dev/null +++ b/docs/configuration-guide.md @@ -0,0 +1,275 @@ +# LEANN Configuration Guide + +This guide helps you optimize LEANN for different use cases and understand the trade-offs between various configuration options. + +## Getting Started: Simple is Better + +When first trying LEANN, start with a small dataset to quickly validate your approach. Use the default `data/` directory which contains just a few files - this lets you test the full pipeline in minutes rather than hours. + +```bash +# Quick test with minimal data +python -m apps.document_rag --max-items 100 --query "What techniques does LEANN use?" +``` + +Once validated, scale up gradually: +- 100 documents → 1,000 → 10,000 → full dataset +- This helps identify issues early before committing to long processing times + +## Embedding Model Selection: Understanding the Trade-offs + +Based on our experience developing LEANN, embedding models fall into three categories: + +### Small Models (384-768 dims) +**Example**: `sentence-transformers/all-MiniLM-L6-v2` +- **Pros**: Fast inference (10-50ms, 384 dims), good for real-time applications +- **Cons**: Lower semantic understanding, may miss nuanced relationships +- **Use when**: Speed is critical, handling simple queries + +### Medium Models (768-1024 dims) +**Example**: `facebook/contriever` +- **Pros**: Balanced performance, good multilingual support, reasonable speed +- **Cons**: Requires more compute than small models +- **Use when**: Need quality results without extreme compute requirements + +### Large Models (1024+ dims) +**Example**: `Qwen/Qwen3-Embedding` +- **Pros**: Best semantic understanding, captures complex relationships, excellent multilingual support +- **Cons**: Slow inference, high memory usage, may overfit on small datasets +- **Use when**: Quality is paramount and you have sufficient compute + +### Cloud vs Local Trade-offs + +**OpenAI Embeddings** (`text-embedding-3-small/large`) +- **Pros**: No local compute needed, consistently fast, high quality +- **Cons**: Requires API key, costs money, data leaves your system, [known limitations with certain languages](https://yichuan-w.github.io/blog/lessons_learned_in_dev_leann/) +- **When to use**: Prototyping, non-sensitive data, need immediate results + +**Local Embeddings** +- **Pros**: Complete privacy, no ongoing costs, full control +- **Cons**: Requires GPU for good performance, setup complexity +- **When to use**: Production systems, sensitive data, cost-sensitive applications + +## Index Selection: Matching Your Scale + +### HNSW (Hierarchical Navigable Small World) +**Best for**: Small to medium datasets (< 10M vectors) +- Fast search (1-10ms latency) +- Full recomputation required (no double queue optimization) +- High memory usage during build phase +- Excellent recall (95%+) + +```bash +# Optimal for most use cases +--backend-name hnsw --graph-degree 32 --build-complexity 64 +``` + +### DiskANN +**Best for**: Large datasets (> 10M vectors, 10GB+ index size) +- Uses Product Quantization (PQ) for coarse filtering in double queue architecture +- Extremely fast search through selective recomputation + +```bash +# For billion-scale deployments +--backend-name diskann --graph-degree 64 --build-complexity 128 +``` + +## LLM Selection: Engine and Model Comparison + +### LLM Engines + +**OpenAI** (`--llm openai`) +- **Pros**: Best quality, consistent performance, no local resources needed +- **Cons**: Costs money ($0.15-2.5 per million tokens), requires internet, data privacy concerns +- **Models**: `gpt-4o-mini` (fast, cheap), `gpt-4o` (best quality), `o3-mini` (reasoning, not so expensive) + +**Ollama** (`--llm ollama`) +- **Pros**: Fully local, free, privacy-preserving, good model variety +- **Cons**: Requires local GPU/CPU resources, slower than cloud +- **Models**: `qwen3:1.7b` (best general quality), `deepseek-r1:1.5b` (reasoning) + +**HuggingFace** (`--llm hf`) +- **Pros**: Free tier available, huge model selection, direct model loading (vs Ollama's server-based approach) +- **Cons**: API rate limits, local mode needs significant resources, more complex setup +- **Models**: `Qwen/Qwen3-1.7B-FP8` + + +### Model Size Trade-offs + +| Model Size | Speed | Quality | Memory | Use Case | +|------------|-------|---------|---------|----------| +| 1B params | 50-100 tok/s | Basic | 2-4GB | Quick answers, simple queries | +| 3B params | 20-50 tok/s | Good | 4-8GB | General purpose RAG | +| 7B params | 10-20 tok/s | Excellent | 8-16GB | Complex reasoning | +| 13B+ params | 5-10 tok/s | Best | 16-32GB+ | Research, detailed analysis | + +## Parameter Tuning Guide + +### Search Complexity Parameters + +**`--build-complexity`** (index building) +- Controls thoroughness during index construction +- Higher = better recall but slower build +- Recommendations: + - 32: Quick prototyping + - 64: Balanced (default) + - 128: Production systems + - 256: Maximum quality + +**`--search-complexity`** (query time) +- Controls search thoroughness +- Higher = better results but slower +- Recommendations: + - 16: Fast/Interactive search (500-1000ms on consumer hardware) + - 32: High quality with diversity (1000-2000ms) + - 64+: Maximum accuracy (2000ms+) + +### Top-K Selection + +**`--top-k`** (number of retrieved chunks) +- More chunks = better context but slower LLM processing +- Should be always smaller than `--search-complexity` +- Guidelines: + - 3-5: Simple factual queries + - 5-10: General questions (default) + - 10+: Complex multi-hop reasoning + +**Trade-off formula**: +- Retrieval time ∝ log(n) × search_complexity +- LLM processing time ∝ top_k × chunk_size +- Total context = top_k × chunk_size tokens + +### Graph Degree (HNSW/DiskANN) + +**`--graph-degree`** +- Number of connections per node in the graph +- Higher = better recall but more memory +- HNSW: 16-32 (default: 32) +- DiskANN: 32-128 (default: 64) + +## Common Configurations by Use Case + +### 1. Quick Experimentation +```bash +python -m apps.document_rag \ + --max-items 1000 \ + --embedding-model sentence-transformers/all-MiniLM-L6-v2 \ + --backend-name hnsw \ + --llm ollama --llm-model llama3.2:1b +``` + +### 2. Personal Knowledge Base +```bash +python -m apps.document_rag \ + --embedding-model facebook/contriever \ + --chunk-size 512 --chunk-overlap 128 \ + --backend-name hnsw \ + --llm ollama --llm-model llama3.2:3b +``` + +### 3. Production RAG System +```bash +python -m apps.document_rag \ + --embedding-model BAAI/bge-base-en-v1.5 \ + --chunk-size 256 --chunk-overlap 64 \ + --backend-name diskann \ + --llm openai --llm-model gpt-4o-mini \ + --top-k 20 --search-complexity 64 +``` + +### 4. Multi-lingual Support (e.g., WeChat) +```bash +python -m apps.wechat_rag \ + --embedding-model intfloat/multilingual-e5-base \ + --chunk-size 192 --chunk-overlap 48 \ + --backend-name hnsw \ + --llm ollama --llm-model qwen3:8b +``` + +## Performance Optimization Checklist + +### If Embedding is Too Slow + +1. **Switch to smaller model**: + ```bash + # From large model + --embedding-model Qwen/Qwen3-Embedding + # To small model + --embedding-model sentence-transformers/all-MiniLM-L6-v2 + ``` + +2. **Use MLX on Apple Silicon**: + ```bash + --embedding-mode mlx --embedding-model mlx-community/multilingual-e5-base-mlx + ``` + +3. **Process in batches**: + ```bash + --max-items 10000 # Process incrementally + ``` + +### If Search Quality is Poor + +1. **Increase retrieval count**: + ```bash + --top-k 30 # Retrieve more candidates + ``` + +2. **Tune chunk size for your content**: + - Technical docs: `--chunk-size 512` + - Chat messages: `--chunk-size 128` + - Mixed content: `--chunk-size 256` + +3. **Upgrade embedding model**: + ```bash + # For English + --embedding-model BAAI/bge-base-en-v1.5 + # For multilingual + --embedding-model intfloat/multilingual-e5-large + ``` + +## Understanding the Trade-offs + +Every configuration choice involves trade-offs: + +| Factor | Small/Fast | Large/Quality | +|--------|------------|---------------| +| Embedding Model | all-MiniLM-L6-v2 | BAAI/bge-large | +| Chunk Size | 128 tokens | 512 tokens | +| Index Type | HNSW | DiskANN | +| LLM | llama3.2:1b | gpt-4o | + +The key is finding the right balance for your specific use case. Start small and simple, measure performance, then scale up only where needed. + +## Deep Dive: Critical Configuration Decisions + +### When to Disable Recomputation + +LEANN's recomputation feature provides exact distance calculations but can be disabled for extreme QPS requirements: + +```bash +--no-recompute # Disable selective recomputation +``` + +**Trade-offs**: +- **With recomputation** (default): Exact distances, best quality, higher latency +- **Without recomputation**: Approximate distances via PQ, 2-5x faster, significantly lower memory and storage usage + +**Disable when**: +- QPS requirements > 1000/sec +- Slight accuracy loss is acceptable +- Running on resource-constrained hardware + +## Performance Monitoring + +Key metrics to watch: +- Index build time +- Query latency (p50, p95, p99) +- Memory usage during build and search +- Disk I/O patterns (for DiskANN) +- Recomputation ratio (% of candidates recomputed) + +## Further Reading + +- [Lessons Learned Developing LEANN](https://yichuan-w.github.io/blog/lessons_learned_in_dev_leann/) +- [LEANN Technical Paper](https://arxiv.org/abs/2506.08276) +- [DiskANN Original Paper](https://papers.nips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf) diff --git a/packages/leann-backend-diskann/third_party/DiskANN b/packages/leann-backend-diskann/third_party/DiskANN index af2a264..67a2611 160000 --- a/packages/leann-backend-diskann/third_party/DiskANN +++ b/packages/leann-backend-diskann/third_party/DiskANN @@ -1 +1 @@ -Subproject commit af2a26481e65232b57b82d96e68833cdee9f7635 +Subproject commit 67a2611ad14bc11d84dfdb554c5567cfb78a2656