Files
LEANN/docs/configuration-guide.md
2025-08-04 16:21:13 -07:00

276 lines
9.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# LEANN Configuration Guide
This guide helps you optimize LEANN for different use cases and understand the trade-offs between various configuration options.
## Getting Started: Simple is Better
When first trying LEANN, start with a small dataset to quickly validate your approach. Use the default `data/` directory which contains just a few files - this lets you test the full pipeline in minutes rather than hours.
```bash
# Quick test with minimal data
python -m apps.document_rag --max-items 100 --query "What techniques does LEANN use?"
```
Once validated, scale up gradually:
- 100 documents → 1,000 → 10,000 → full dataset
- This helps identify issues early before committing to long processing times
## Embedding Model Selection: Understanding the Trade-offs
Based on our experience developing LEANN, embedding models fall into three categories:
### Small Models (384-768 dims)
**Example**: `sentence-transformers/all-MiniLM-L6-v2`
- **Pros**: Fast inference (10-50ms, 384 dims), good for real-time applications
- **Cons**: Lower semantic understanding, may miss nuanced relationships
- **Use when**: Speed is critical, handling simple queries
### Medium Models (768-1024 dims)
**Example**: `facebook/contriever`
- **Pros**: Balanced performance, good multilingual support, reasonable speed
- **Cons**: Requires more compute than small models
- **Use when**: Need quality results without extreme compute requirements
### Large Models (1024+ dims)
**Example**: `Qwen/Qwen3-Embedding`
- **Pros**: Best semantic understanding, captures complex relationships, excellent multilingual support
- **Cons**: Slow inference, high memory usage, may overfit on small datasets
- **Use when**: Quality is paramount and you have sufficient compute
### Cloud vs Local Trade-offs
**OpenAI Embeddings** (`text-embedding-3-small/large`)
- **Pros**: No local compute needed, consistently fast, high quality
- **Cons**: Requires API key, costs money, data leaves your system, [known limitations with certain languages](https://yichuan-w.github.io/blog/lessons_learned_in_dev_leann/)
- **When to use**: Prototyping, non-sensitive data, need immediate results
**Local Embeddings**
- **Pros**: Complete privacy, no ongoing costs, full control
- **Cons**: Requires GPU for good performance, setup complexity
- **When to use**: Production systems, sensitive data, cost-sensitive applications
## Index Selection: Matching Your Scale
### HNSW (Hierarchical Navigable Small World)
**Best for**: Small to medium datasets (< 10M vectors)
- Fast search (1-10ms latency)
- Full recomputation required (no double queue optimization)
- High memory usage during build phase
- Excellent recall (95%+)
```bash
# Optimal for most use cases
--backend-name hnsw --graph-degree 32 --build-complexity 64
```
### DiskANN
**Best for**: Large datasets (> 10M vectors, 10GB+ index size)
- Uses Product Quantization (PQ) for coarse filtering in double queue architecture
- Extremely fast search through selective recomputation
```bash
# For billion-scale deployments
--backend-name diskann --graph-degree 64 --build-complexity 128
```
## LLM Selection: Engine and Model Comparison
### LLM Engines
**OpenAI** (`--llm openai`)
- **Pros**: Best quality, consistent performance, no local resources needed
- **Cons**: Costs money ($0.15-2.5 per million tokens), requires internet, data privacy concerns
- **Models**: `gpt-4o-mini` (fast, cheap), `gpt-4o` (best quality), `o3-mini` (reasoning, not so expensive)
**Ollama** (`--llm ollama`)
- **Pros**: Fully local, free, privacy-preserving, good model variety
- **Cons**: Requires local GPU/CPU resources, slower than cloud
- **Models**: `qwen3:1.7b` (best general quality), `deepseek-r1:1.5b` (reasoning)
**HuggingFace** (`--llm hf`)
- **Pros**: Free tier available, huge model selection, direct model loading (vs Ollama's server-based approach)
- **Cons**: API rate limits, local mode needs significant resources, more complex setup
- **Models**: `Qwen/Qwen3-1.7B-FP8`
### Model Size Trade-offs
| Model Size | Speed | Quality | Memory | Use Case |
|------------|-------|---------|---------|----------|
| 1B params | 50-100 tok/s | Basic | 2-4GB | Quick answers, simple queries |
| 3B params | 20-50 tok/s | Good | 4-8GB | General purpose RAG |
| 7B params | 10-20 tok/s | Excellent | 8-16GB | Complex reasoning |
| 13B+ params | 5-10 tok/s | Best | 16-32GB+ | Research, detailed analysis |
## Parameter Tuning Guide
### Search Complexity Parameters
**`--build-complexity`** (index building)
- Controls thoroughness during index construction
- Higher = better recall but slower build
- Recommendations:
- 32: Quick prototyping
- 64: Balanced (default)
- 128: Production systems
- 256: Maximum quality
**`--search-complexity`** (query time)
- Controls search thoroughness
- Higher = better results but slower
- Recommendations:
- 16: Fast/Interactive search (500-1000ms on consumer hardware)
- 32: High quality with diversity (1000-2000ms)
- 64+: Maximum accuracy (2000ms+)
### Top-K Selection
**`--top-k`** (number of retrieved chunks)
- More chunks = better context but slower LLM processing
- Should be always smaller than `--search-complexity`
- Guidelines:
- 3-5: Simple factual queries
- 5-10: General questions (default)
- 10+: Complex multi-hop reasoning
**Trade-off formula**:
- Retrieval time ∝ log(n) × search_complexity
- LLM processing time ∝ top_k × chunk_size
- Total context = top_k × chunk_size tokens
### Graph Degree (HNSW/DiskANN)
**`--graph-degree`**
- Number of connections per node in the graph
- Higher = better recall but more memory
- HNSW: 16-32 (default: 32)
- DiskANN: 32-128 (default: 64)
## Common Configurations by Use Case
### 1. Quick Experimentation
```bash
python -m apps.document_rag \
--max-items 1000 \
--embedding-model sentence-transformers/all-MiniLM-L6-v2 \
--backend-name hnsw \
--llm ollama --llm-model llama3.2:1b
```
### 2. Personal Knowledge Base
```bash
python -m apps.document_rag \
--embedding-model facebook/contriever \
--chunk-size 512 --chunk-overlap 128 \
--backend-name hnsw \
--llm ollama --llm-model llama3.2:3b
```
### 3. Production RAG System
```bash
python -m apps.document_rag \
--embedding-model BAAI/bge-base-en-v1.5 \
--chunk-size 256 --chunk-overlap 64 \
--backend-name diskann \
--llm openai --llm-model gpt-4o-mini \
--top-k 20 --search-complexity 64
```
### 4. Multi-lingual Support (e.g., WeChat)
```bash
python -m apps.wechat_rag \
--embedding-model intfloat/multilingual-e5-base \
--chunk-size 192 --chunk-overlap 48 \
--backend-name hnsw \
--llm ollama --llm-model qwen3:8b
```
## Performance Optimization Checklist
### If Embedding is Too Slow
1. **Switch to smaller model**:
```bash
# From large model
--embedding-model Qwen/Qwen3-Embedding
# To small model
--embedding-model sentence-transformers/all-MiniLM-L6-v2
```
2. **Use MLX on Apple Silicon**:
```bash
--embedding-mode mlx --embedding-model mlx-community/multilingual-e5-base-mlx
```
3. **Process in batches**:
```bash
--max-items 10000 # Process incrementally
```
### If Search Quality is Poor
1. **Increase retrieval count**:
```bash
--top-k 30 # Retrieve more candidates
```
2. **Tune chunk size for your content**:
- Technical docs: `--chunk-size 512`
- Chat messages: `--chunk-size 128`
- Mixed content: `--chunk-size 256`
3. **Upgrade embedding model**:
```bash
# For English
--embedding-model BAAI/bge-base-en-v1.5
# For multilingual
--embedding-model intfloat/multilingual-e5-large
```
## Understanding the Trade-offs
Every configuration choice involves trade-offs:
| Factor | Small/Fast | Large/Quality |
|--------|------------|---------------|
| Embedding Model | all-MiniLM-L6-v2 | BAAI/bge-large |
| Chunk Size | 128 tokens | 512 tokens |
| Index Type | HNSW | DiskANN |
| LLM | llama3.2:1b | gpt-4o |
The key is finding the right balance for your specific use case. Start small and simple, measure performance, then scale up only where needed.
## Deep Dive: Critical Configuration Decisions
### When to Disable Recomputation
LEANN's recomputation feature provides exact distance calculations but can be disabled for extreme QPS requirements:
```bash
--no-recompute # Disable selective recomputation
```
**Trade-offs**:
- **With recomputation** (default): Exact distances, best quality, higher latency
- **Without recomputation**: Approximate distances via PQ, 2-5x faster, significantly lower memory and storage usage
**Disable when**:
- QPS requirements > 1000/sec
- Slight accuracy loss is acceptable
- Running on resource-constrained hardware
## Performance Monitoring
Key metrics to watch:
- Index build time
- Query latency (p50, p95, p99)
- Memory usage during build and search
- Disk I/O patterns (for DiskANN)
- Recomputation ratio (% of candidates recomputed)
## Further Reading
- [Lessons Learned Developing LEANN](https://yichuan-w.github.io/blog/lessons_learned_in_dev_leann/)
- [LEANN Technical Paper](https://arxiv.org/abs/2506.08276)
- [DiskANN Original Paper](https://papers.nips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf)