docs: config guidance
This commit is contained in:
@@ -170,6 +170,13 @@ ollama pull llama3.2:1b
|
||||
|
||||
LEANN provides flexible parameters for embedding models, search strategies, and data processing to fit your specific needs.
|
||||
|
||||
📚 **Having trouble with configuration?** Check our [Configuration Guide](docs/configuration-guide.md) for:
|
||||
- Quick start configurations for each use case
|
||||
- Solutions for "embedding too slow" issues
|
||||
- How to choose the right chat model
|
||||
- Fixing poor search quality
|
||||
- Performance optimization tips
|
||||
|
||||
<details>
|
||||
<summary><strong>📋 Click to expand: Common Parameters (Available in All Examples)</strong></summary>
|
||||
|
||||
|
||||
275
docs/configuration-guide.md
Normal file
275
docs/configuration-guide.md
Normal file
@@ -0,0 +1,275 @@
|
||||
# LEANN Configuration Guide
|
||||
|
||||
This guide helps you optimize LEANN for different use cases and understand the trade-offs between various configuration options.
|
||||
|
||||
## Getting Started: Simple is Better
|
||||
|
||||
When first trying LEANN, start with a small dataset to quickly validate your approach. Use the default `data/` directory which contains just a few files - this lets you test the full pipeline in minutes rather than hours.
|
||||
|
||||
```bash
|
||||
# Quick test with minimal data
|
||||
python -m apps.document_rag --max-items 100 --query "What techniques does LEANN use?"
|
||||
```
|
||||
|
||||
Once validated, scale up gradually:
|
||||
- 100 documents → 1,000 → 10,000 → full dataset
|
||||
- This helps identify issues early before committing to long processing times
|
||||
|
||||
## Embedding Model Selection: Understanding the Trade-offs
|
||||
|
||||
Based on our experience developing LEANN, embedding models fall into three categories:
|
||||
|
||||
### Small Models (384-768 dims)
|
||||
**Example**: `sentence-transformers/all-MiniLM-L6-v2`
|
||||
- **Pros**: Fast inference (10-50ms, 384 dims), good for real-time applications
|
||||
- **Cons**: Lower semantic understanding, may miss nuanced relationships
|
||||
- **Use when**: Speed is critical, handling simple queries
|
||||
|
||||
### Medium Models (768-1024 dims)
|
||||
**Example**: `facebook/contriever`
|
||||
- **Pros**: Balanced performance, good multilingual support, reasonable speed
|
||||
- **Cons**: Requires more compute than small models
|
||||
- **Use when**: Need quality results without extreme compute requirements
|
||||
|
||||
### Large Models (1024+ dims)
|
||||
**Example**: `Qwen/Qwen3-Embedding`
|
||||
- **Pros**: Best semantic understanding, captures complex relationships, excellent multilingual support
|
||||
- **Cons**: Slow inference, high memory usage, may overfit on small datasets
|
||||
- **Use when**: Quality is paramount and you have sufficient compute
|
||||
|
||||
### Cloud vs Local Trade-offs
|
||||
|
||||
**OpenAI Embeddings** (`text-embedding-3-small/large`)
|
||||
- **Pros**: No local compute needed, consistently fast, high quality
|
||||
- **Cons**: Requires API key, costs money, data leaves your system, [known limitations with certain languages](https://yichuan-w.github.io/blog/lessons_learned_in_dev_leann/)
|
||||
- **When to use**: Prototyping, non-sensitive data, need immediate results
|
||||
|
||||
**Local Embeddings**
|
||||
- **Pros**: Complete privacy, no ongoing costs, full control
|
||||
- **Cons**: Requires GPU for good performance, setup complexity
|
||||
- **When to use**: Production systems, sensitive data, cost-sensitive applications
|
||||
|
||||
## Index Selection: Matching Your Scale
|
||||
|
||||
### HNSW (Hierarchical Navigable Small World)
|
||||
**Best for**: Small to medium datasets (< 10M vectors)
|
||||
- Fast search (1-10ms latency)
|
||||
- Full recomputation required (no double queue optimization)
|
||||
- High memory usage during build phase
|
||||
- Excellent recall (95%+)
|
||||
|
||||
```bash
|
||||
# Optimal for most use cases
|
||||
--backend-name hnsw --graph-degree 32 --build-complexity 64
|
||||
```
|
||||
|
||||
### DiskANN
|
||||
**Best for**: Large datasets (> 10M vectors, 10GB+ index size)
|
||||
- Uses Product Quantization (PQ) for coarse filtering in double queue architecture
|
||||
- Extremely fast search through selective recomputation
|
||||
|
||||
```bash
|
||||
# For billion-scale deployments
|
||||
--backend-name diskann --graph-degree 64 --build-complexity 128
|
||||
```
|
||||
|
||||
## LLM Selection: Engine and Model Comparison
|
||||
|
||||
### LLM Engines
|
||||
|
||||
**OpenAI** (`--llm openai`)
|
||||
- **Pros**: Best quality, consistent performance, no local resources needed
|
||||
- **Cons**: Costs money ($0.15-2.5 per million tokens), requires internet, data privacy concerns
|
||||
- **Models**: `gpt-4o-mini` (fast, cheap), `gpt-4o` (best quality), `o3-mini` (reasoning, not so expensive)
|
||||
|
||||
**Ollama** (`--llm ollama`)
|
||||
- **Pros**: Fully local, free, privacy-preserving, good model variety
|
||||
- **Cons**: Requires local GPU/CPU resources, slower than cloud
|
||||
- **Models**: `qwen3:1.7b` (best general quality), `deepseek-r1:1.5b` (reasoning)
|
||||
|
||||
**HuggingFace** (`--llm hf`)
|
||||
- **Pros**: Free tier available, huge model selection, direct model loading (vs Ollama's server-based approach)
|
||||
- **Cons**: API rate limits, local mode needs significant resources, more complex setup
|
||||
- **Models**: `Qwen/Qwen3-1.7B-FP8`
|
||||
|
||||
|
||||
### Model Size Trade-offs
|
||||
|
||||
| Model Size | Speed | Quality | Memory | Use Case |
|
||||
|------------|-------|---------|---------|----------|
|
||||
| 1B params | 50-100 tok/s | Basic | 2-4GB | Quick answers, simple queries |
|
||||
| 3B params | 20-50 tok/s | Good | 4-8GB | General purpose RAG |
|
||||
| 7B params | 10-20 tok/s | Excellent | 8-16GB | Complex reasoning |
|
||||
| 13B+ params | 5-10 tok/s | Best | 16-32GB+ | Research, detailed analysis |
|
||||
|
||||
## Parameter Tuning Guide
|
||||
|
||||
### Search Complexity Parameters
|
||||
|
||||
**`--build-complexity`** (index building)
|
||||
- Controls thoroughness during index construction
|
||||
- Higher = better recall but slower build
|
||||
- Recommendations:
|
||||
- 32: Quick prototyping
|
||||
- 64: Balanced (default)
|
||||
- 128: Production systems
|
||||
- 256: Maximum quality
|
||||
|
||||
**`--search-complexity`** (query time)
|
||||
- Controls search thoroughness
|
||||
- Higher = better results but slower
|
||||
- Recommendations:
|
||||
- 16: Fast/Interactive search (500-1000ms on consumer hardware)
|
||||
- 32: High quality with diversity (1000-2000ms)
|
||||
- 64+: Maximum accuracy (2000ms+)
|
||||
|
||||
### Top-K Selection
|
||||
|
||||
**`--top-k`** (number of retrieved chunks)
|
||||
- More chunks = better context but slower LLM processing
|
||||
- Should be always smaller than `--search-complexity`
|
||||
- Guidelines:
|
||||
- 3-5: Simple factual queries
|
||||
- 5-10: General questions (default)
|
||||
- 10+: Complex multi-hop reasoning
|
||||
|
||||
**Trade-off formula**:
|
||||
- Retrieval time ∝ log(n) × search_complexity
|
||||
- LLM processing time ∝ top_k × chunk_size
|
||||
- Total context = top_k × chunk_size tokens
|
||||
|
||||
### Graph Degree (HNSW/DiskANN)
|
||||
|
||||
**`--graph-degree`**
|
||||
- Number of connections per node in the graph
|
||||
- Higher = better recall but more memory
|
||||
- HNSW: 16-32 (default: 32)
|
||||
- DiskANN: 32-128 (default: 64)
|
||||
|
||||
## Common Configurations by Use Case
|
||||
|
||||
### 1. Quick Experimentation
|
||||
```bash
|
||||
python -m apps.document_rag \
|
||||
--max-items 1000 \
|
||||
--embedding-model sentence-transformers/all-MiniLM-L6-v2 \
|
||||
--backend-name hnsw \
|
||||
--llm ollama --llm-model llama3.2:1b
|
||||
```
|
||||
|
||||
### 2. Personal Knowledge Base
|
||||
```bash
|
||||
python -m apps.document_rag \
|
||||
--embedding-model facebook/contriever \
|
||||
--chunk-size 512 --chunk-overlap 128 \
|
||||
--backend-name hnsw \
|
||||
--llm ollama --llm-model llama3.2:3b
|
||||
```
|
||||
|
||||
### 3. Production RAG System
|
||||
```bash
|
||||
python -m apps.document_rag \
|
||||
--embedding-model BAAI/bge-base-en-v1.5 \
|
||||
--chunk-size 256 --chunk-overlap 64 \
|
||||
--backend-name diskann \
|
||||
--llm openai --llm-model gpt-4o-mini \
|
||||
--top-k 20 --search-complexity 64
|
||||
```
|
||||
|
||||
### 4. Multi-lingual Support (e.g., WeChat)
|
||||
```bash
|
||||
python -m apps.wechat_rag \
|
||||
--embedding-model intfloat/multilingual-e5-base \
|
||||
--chunk-size 192 --chunk-overlap 48 \
|
||||
--backend-name hnsw \
|
||||
--llm ollama --llm-model qwen3:8b
|
||||
```
|
||||
|
||||
## Performance Optimization Checklist
|
||||
|
||||
### If Embedding is Too Slow
|
||||
|
||||
1. **Switch to smaller model**:
|
||||
```bash
|
||||
# From large model
|
||||
--embedding-model Qwen/Qwen3-Embedding
|
||||
# To small model
|
||||
--embedding-model sentence-transformers/all-MiniLM-L6-v2
|
||||
```
|
||||
|
||||
2. **Use MLX on Apple Silicon**:
|
||||
```bash
|
||||
--embedding-mode mlx --embedding-model mlx-community/multilingual-e5-base-mlx
|
||||
```
|
||||
|
||||
3. **Process in batches**:
|
||||
```bash
|
||||
--max-items 10000 # Process incrementally
|
||||
```
|
||||
|
||||
### If Search Quality is Poor
|
||||
|
||||
1. **Increase retrieval count**:
|
||||
```bash
|
||||
--top-k 30 # Retrieve more candidates
|
||||
```
|
||||
|
||||
2. **Tune chunk size for your content**:
|
||||
- Technical docs: `--chunk-size 512`
|
||||
- Chat messages: `--chunk-size 128`
|
||||
- Mixed content: `--chunk-size 256`
|
||||
|
||||
3. **Upgrade embedding model**:
|
||||
```bash
|
||||
# For English
|
||||
--embedding-model BAAI/bge-base-en-v1.5
|
||||
# For multilingual
|
||||
--embedding-model intfloat/multilingual-e5-large
|
||||
```
|
||||
|
||||
## Understanding the Trade-offs
|
||||
|
||||
Every configuration choice involves trade-offs:
|
||||
|
||||
| Factor | Small/Fast | Large/Quality |
|
||||
|--------|------------|---------------|
|
||||
| Embedding Model | all-MiniLM-L6-v2 | BAAI/bge-large |
|
||||
| Chunk Size | 128 tokens | 512 tokens |
|
||||
| Index Type | HNSW | DiskANN |
|
||||
| LLM | llama3.2:1b | gpt-4o |
|
||||
|
||||
The key is finding the right balance for your specific use case. Start small and simple, measure performance, then scale up only where needed.
|
||||
|
||||
## Deep Dive: Critical Configuration Decisions
|
||||
|
||||
### When to Disable Recomputation
|
||||
|
||||
LEANN's recomputation feature provides exact distance calculations but can be disabled for extreme QPS requirements:
|
||||
|
||||
```bash
|
||||
--no-recompute # Disable selective recomputation
|
||||
```
|
||||
|
||||
**Trade-offs**:
|
||||
- **With recomputation** (default): Exact distances, best quality, higher latency
|
||||
- **Without recomputation**: Approximate distances via PQ, 2-5x faster, significantly lower memory and storage usage
|
||||
|
||||
**Disable when**:
|
||||
- QPS requirements > 1000/sec
|
||||
- Slight accuracy loss is acceptable
|
||||
- Running on resource-constrained hardware
|
||||
|
||||
## Performance Monitoring
|
||||
|
||||
Key metrics to watch:
|
||||
- Index build time
|
||||
- Query latency (p50, p95, p99)
|
||||
- Memory usage during build and search
|
||||
- Disk I/O patterns (for DiskANN)
|
||||
- Recomputation ratio (% of candidates recomputed)
|
||||
|
||||
## Further Reading
|
||||
|
||||
- [Lessons Learned Developing LEANN](https://yichuan-w.github.io/blog/lessons_learned_in_dev_leann/)
|
||||
- [LEANN Technical Paper](https://arxiv.org/abs/2506.08276)
|
||||
- [DiskANN Original Paper](https://papers.nips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf)
|
||||
Submodule packages/leann-backend-diskann/third_party/DiskANN updated: af2a26481e...67a2611ad1
Reference in New Issue
Block a user