9.0 KiB
LEANN Configuration Guide
This guide helps you optimize LEANN for different use cases and understand the trade-offs between various configuration options.
Getting Started: Simple is Better
When first trying LEANN, start with a small dataset to quickly validate your approach. Use the default data/ directory which contains just a few files - this lets you test the full pipeline in minutes rather than hours.
# Quick test with minimal data
python -m apps.document_rag --max-items 100 --query "What techniques does LEANN use?"
Once validated, scale up gradually:
- 100 documents → 1,000 → 10,000 → full dataset
- This helps identify issues early before committing to long processing times
Embedding Model Selection: Understanding the Trade-offs
Based on our experience developing LEANN, embedding models fall into three categories:
Small Models (384-768 dims)
Example: sentence-transformers/all-MiniLM-L6-v2
- Pros: Fast inference (10-50ms, 384 dims), good for real-time applications
- Cons: Lower semantic understanding, may miss nuanced relationships
- Use when: Speed is critical, handling simple queries
Medium Models (768-1024 dims)
Example: facebook/contriever
- Pros: Balanced performance, good multilingual support, reasonable speed
- Cons: Requires more compute than small models
- Use when: Need quality results without extreme compute requirements
Large Models (1024+ dims)
Example: Qwen/Qwen3-Embedding
- Pros: Best semantic understanding, captures complex relationships, excellent multilingual support
- Cons: Slow inference, high memory usage, may overfit on small datasets
- Use when: Quality is paramount and you have sufficient compute
Cloud vs Local Trade-offs
OpenAI Embeddings (text-embedding-3-small/large)
- Pros: No local compute needed, consistently fast, high quality
- Cons: Requires API key, costs money, data leaves your system, known limitations with certain languages
- When to use: Prototyping, non-sensitive data, need immediate results
Local Embeddings
- Pros: Complete privacy, no ongoing costs, full control
- Cons: Requires GPU for good performance, setup complexity
- When to use: Production systems, sensitive data, cost-sensitive applications
Index Selection: Matching Your Scale
HNSW (Hierarchical Navigable Small World)
Best for: Small to medium datasets (< 10M vectors)
- Fast search (1-10ms latency)
- Full recomputation required (no double queue optimization)
- High memory usage during build phase
- Excellent recall (95%+)
# Optimal for most use cases
--backend-name hnsw --graph-degree 32 --build-complexity 64
DiskANN
Best for: Large datasets (> 10M vectors, 10GB+ index size)
- Uses Product Quantization (PQ) for coarse filtering in double queue architecture
- Extremely fast search through selective recomputation
# For billion-scale deployments
--backend-name diskann --graph-degree 64 --build-complexity 128
LLM Selection: Engine and Model Comparison
LLM Engines
OpenAI (--llm openai)
- Pros: Best quality, consistent performance, no local resources needed
- Cons: Costs money ($0.15-2.5 per million tokens), requires internet, data privacy concerns
- Models:
gpt-4o-mini(fast, cheap),gpt-4o(best quality),o3-mini(reasoning, not so expensive)
Ollama (--llm ollama)
- Pros: Fully local, free, privacy-preserving, good model variety
- Cons: Requires local GPU/CPU resources, slower than cloud
- Models:
qwen3:1.7b(best general quality),deepseek-r1:1.5b(reasoning)
HuggingFace (--llm hf)
- Pros: Free tier available, huge model selection, direct model loading (vs Ollama's server-based approach)
- Cons: API rate limits, local mode needs significant resources, more complex setup
- Models:
Qwen/Qwen3-1.7B-FP8
Model Size Trade-offs
| Model Size | Speed | Quality | Memory | Use Case |
|---|---|---|---|---|
| 1B params | 50-100 tok/s | Basic | 2-4GB | Quick answers, simple queries |
| 3B params | 20-50 tok/s | Good | 4-8GB | General purpose RAG |
| 7B params | 10-20 tok/s | Excellent | 8-16GB | Complex reasoning |
| 13B+ params | 5-10 tok/s | Best | 16-32GB+ | Research, detailed analysis |
Parameter Tuning Guide
Search Complexity Parameters
--build-complexity (index building)
- Controls thoroughness during index construction
- Higher = better recall but slower build
- Recommendations:
- 32: Quick prototyping
- 64: Balanced (default)
- 128: Production systems
- 256: Maximum quality
--search-complexity (query time)
- Controls search thoroughness
- Higher = better results but slower
- Recommendations:
- 16: Fast/Interactive search (500-1000ms on consumer hardware)
- 32: High quality with diversity (1000-2000ms)
- 64+: Maximum accuracy (2000ms+)
Top-K Selection
--top-k (number of retrieved chunks)
- More chunks = better context but slower LLM processing
- Should be always smaller than
--search-complexity - Guidelines:
- 3-5: Simple factual queries
- 5-10: General questions (default)
- 10+: Complex multi-hop reasoning
Trade-off formula:
- Retrieval time ∝ log(n) × search_complexity
- LLM processing time ∝ top_k × chunk_size
- Total context = top_k × chunk_size tokens
Graph Degree (HNSW/DiskANN)
--graph-degree
- Number of connections per node in the graph
- Higher = better recall but more memory
- HNSW: 16-32 (default: 32)
- DiskANN: 32-128 (default: 64)
Common Configurations by Use Case
1. Quick Experimentation
python -m apps.document_rag \
--max-items 1000 \
--embedding-model sentence-transformers/all-MiniLM-L6-v2 \
--backend-name hnsw \
--llm ollama --llm-model llama3.2:1b
2. Personal Knowledge Base
python -m apps.document_rag \
--embedding-model facebook/contriever \
--chunk-size 512 --chunk-overlap 128 \
--backend-name hnsw \
--llm ollama --llm-model llama3.2:3b
3. Production RAG System
python -m apps.document_rag \
--embedding-model BAAI/bge-base-en-v1.5 \
--chunk-size 256 --chunk-overlap 64 \
--backend-name diskann \
--llm openai --llm-model gpt-4o-mini \
--top-k 20 --search-complexity 64
4. Multi-lingual Support (e.g., WeChat)
python -m apps.wechat_rag \
--embedding-model intfloat/multilingual-e5-base \
--chunk-size 192 --chunk-overlap 48 \
--backend-name hnsw \
--llm ollama --llm-model qwen3:8b
Performance Optimization Checklist
If Embedding is Too Slow
-
Switch to smaller model:
# From large model --embedding-model Qwen/Qwen3-Embedding # To small model --embedding-model sentence-transformers/all-MiniLM-L6-v2 -
Use MLX on Apple Silicon:
--embedding-mode mlx --embedding-model mlx-community/multilingual-e5-base-mlx -
Process in batches:
--max-items 10000 # Process incrementally
If Search Quality is Poor
-
Increase retrieval count:
--top-k 30 # Retrieve more candidates -
Tune chunk size for your content:
- Technical docs:
--chunk-size 512 - Chat messages:
--chunk-size 128 - Mixed content:
--chunk-size 256
- Technical docs:
-
Upgrade embedding model:
# For English --embedding-model BAAI/bge-base-en-v1.5 # For multilingual --embedding-model intfloat/multilingual-e5-large
Understanding the Trade-offs
Every configuration choice involves trade-offs:
| Factor | Small/Fast | Large/Quality |
|---|---|---|
| Embedding Model | all-MiniLM-L6-v2 | BAAI/bge-large |
| Chunk Size | 128 tokens | 512 tokens |
| Index Type | HNSW | DiskANN |
| LLM | llama3.2:1b | gpt-4o |
The key is finding the right balance for your specific use case. Start small and simple, measure performance, then scale up only where needed.
Deep Dive: Critical Configuration Decisions
When to Disable Recomputation
LEANN's recomputation feature provides exact distance calculations but can be disabled for extreme QPS requirements:
--no-recompute # Disable selective recomputation
Trade-offs:
- With recomputation (default): Exact distances, best quality, higher latency
- Without recomputation: Approximate distances via PQ, 2-5x faster, significantly lower memory and storage usage
Disable when:
- QPS requirements > 1000/sec
- Slight accuracy loss is acceptable
- Running on resource-constrained hardware
Performance Monitoring
Key metrics to watch:
- Index build time
- Query latency (p50, p95, p99)
- Memory usage during build and search
- Disk I/O patterns (for DiskANN)
- Recomputation ratio (% of candidates recomputed)