Files
LEANN/docs/configuration-guide.md
2025-08-04 16:21:13 -07:00

9.0 KiB
Raw Blame History

LEANN Configuration Guide

This guide helps you optimize LEANN for different use cases and understand the trade-offs between various configuration options.

Getting Started: Simple is Better

When first trying LEANN, start with a small dataset to quickly validate your approach. Use the default data/ directory which contains just a few files - this lets you test the full pipeline in minutes rather than hours.

# Quick test with minimal data
python -m apps.document_rag --max-items 100 --query "What techniques does LEANN use?"

Once validated, scale up gradually:

  • 100 documents → 1,000 → 10,000 → full dataset
  • This helps identify issues early before committing to long processing times

Embedding Model Selection: Understanding the Trade-offs

Based on our experience developing LEANN, embedding models fall into three categories:

Small Models (384-768 dims)

Example: sentence-transformers/all-MiniLM-L6-v2

  • Pros: Fast inference (10-50ms, 384 dims), good for real-time applications
  • Cons: Lower semantic understanding, may miss nuanced relationships
  • Use when: Speed is critical, handling simple queries

Medium Models (768-1024 dims)

Example: facebook/contriever

  • Pros: Balanced performance, good multilingual support, reasonable speed
  • Cons: Requires more compute than small models
  • Use when: Need quality results without extreme compute requirements

Large Models (1024+ dims)

Example: Qwen/Qwen3-Embedding

  • Pros: Best semantic understanding, captures complex relationships, excellent multilingual support
  • Cons: Slow inference, high memory usage, may overfit on small datasets
  • Use when: Quality is paramount and you have sufficient compute

Cloud vs Local Trade-offs

OpenAI Embeddings (text-embedding-3-small/large)

  • Pros: No local compute needed, consistently fast, high quality
  • Cons: Requires API key, costs money, data leaves your system, known limitations with certain languages
  • When to use: Prototyping, non-sensitive data, need immediate results

Local Embeddings

  • Pros: Complete privacy, no ongoing costs, full control
  • Cons: Requires GPU for good performance, setup complexity
  • When to use: Production systems, sensitive data, cost-sensitive applications

Index Selection: Matching Your Scale

HNSW (Hierarchical Navigable Small World)

Best for: Small to medium datasets (< 10M vectors)

  • Fast search (1-10ms latency)
  • Full recomputation required (no double queue optimization)
  • High memory usage during build phase
  • Excellent recall (95%+)
# Optimal for most use cases
--backend-name hnsw --graph-degree 32 --build-complexity 64

DiskANN

Best for: Large datasets (> 10M vectors, 10GB+ index size)

  • Uses Product Quantization (PQ) for coarse filtering in double queue architecture
  • Extremely fast search through selective recomputation
# For billion-scale deployments
--backend-name diskann --graph-degree 64 --build-complexity 128

LLM Selection: Engine and Model Comparison

LLM Engines

OpenAI (--llm openai)

  • Pros: Best quality, consistent performance, no local resources needed
  • Cons: Costs money ($0.15-2.5 per million tokens), requires internet, data privacy concerns
  • Models: gpt-4o-mini (fast, cheap), gpt-4o (best quality), o3-mini (reasoning, not so expensive)

Ollama (--llm ollama)

  • Pros: Fully local, free, privacy-preserving, good model variety
  • Cons: Requires local GPU/CPU resources, slower than cloud
  • Models: qwen3:1.7b (best general quality), deepseek-r1:1.5b (reasoning)

HuggingFace (--llm hf)

  • Pros: Free tier available, huge model selection, direct model loading (vs Ollama's server-based approach)
  • Cons: API rate limits, local mode needs significant resources, more complex setup
  • Models: Qwen/Qwen3-1.7B-FP8

Model Size Trade-offs

Model Size Speed Quality Memory Use Case
1B params 50-100 tok/s Basic 2-4GB Quick answers, simple queries
3B params 20-50 tok/s Good 4-8GB General purpose RAG
7B params 10-20 tok/s Excellent 8-16GB Complex reasoning
13B+ params 5-10 tok/s Best 16-32GB+ Research, detailed analysis

Parameter Tuning Guide

Search Complexity Parameters

--build-complexity (index building)

  • Controls thoroughness during index construction
  • Higher = better recall but slower build
  • Recommendations:
    • 32: Quick prototyping
    • 64: Balanced (default)
    • 128: Production systems
    • 256: Maximum quality

--search-complexity (query time)

  • Controls search thoroughness
  • Higher = better results but slower
  • Recommendations:
    • 16: Fast/Interactive search (500-1000ms on consumer hardware)
    • 32: High quality with diversity (1000-2000ms)
    • 64+: Maximum accuracy (2000ms+)

Top-K Selection

--top-k (number of retrieved chunks)

  • More chunks = better context but slower LLM processing
  • Should be always smaller than --search-complexity
  • Guidelines:
    • 3-5: Simple factual queries
    • 5-10: General questions (default)
    • 10+: Complex multi-hop reasoning

Trade-off formula:

  • Retrieval time ∝ log(n) × search_complexity
  • LLM processing time ∝ top_k × chunk_size
  • Total context = top_k × chunk_size tokens

Graph Degree (HNSW/DiskANN)

--graph-degree

  • Number of connections per node in the graph
  • Higher = better recall but more memory
  • HNSW: 16-32 (default: 32)
  • DiskANN: 32-128 (default: 64)

Common Configurations by Use Case

1. Quick Experimentation

python -m apps.document_rag \
  --max-items 1000 \
  --embedding-model sentence-transformers/all-MiniLM-L6-v2 \
  --backend-name hnsw \
  --llm ollama --llm-model llama3.2:1b

2. Personal Knowledge Base

python -m apps.document_rag \
  --embedding-model facebook/contriever \
  --chunk-size 512 --chunk-overlap 128 \
  --backend-name hnsw \
  --llm ollama --llm-model llama3.2:3b

3. Production RAG System

python -m apps.document_rag \
  --embedding-model BAAI/bge-base-en-v1.5 \
  --chunk-size 256 --chunk-overlap 64 \
  --backend-name diskann \
  --llm openai --llm-model gpt-4o-mini \
  --top-k 20 --search-complexity 64

4. Multi-lingual Support (e.g., WeChat)

python -m apps.wechat_rag \
  --embedding-model intfloat/multilingual-e5-base \
  --chunk-size 192 --chunk-overlap 48 \
  --backend-name hnsw \
  --llm ollama --llm-model qwen3:8b

Performance Optimization Checklist

If Embedding is Too Slow

  1. Switch to smaller model:

    # From large model
    --embedding-model Qwen/Qwen3-Embedding
    # To small model
    --embedding-model sentence-transformers/all-MiniLM-L6-v2
    
  2. Use MLX on Apple Silicon:

    --embedding-mode mlx --embedding-model mlx-community/multilingual-e5-base-mlx
    
  3. Process in batches:

    --max-items 10000  # Process incrementally
    

If Search Quality is Poor

  1. Increase retrieval count:

    --top-k 30  # Retrieve more candidates
    
  2. Tune chunk size for your content:

    • Technical docs: --chunk-size 512
    • Chat messages: --chunk-size 128
    • Mixed content: --chunk-size 256
  3. Upgrade embedding model:

    # For English
    --embedding-model BAAI/bge-base-en-v1.5
    # For multilingual
    --embedding-model intfloat/multilingual-e5-large
    

Understanding the Trade-offs

Every configuration choice involves trade-offs:

Factor Small/Fast Large/Quality
Embedding Model all-MiniLM-L6-v2 BAAI/bge-large
Chunk Size 128 tokens 512 tokens
Index Type HNSW DiskANN
LLM llama3.2:1b gpt-4o

The key is finding the right balance for your specific use case. Start small and simple, measure performance, then scale up only where needed.

Deep Dive: Critical Configuration Decisions

When to Disable Recomputation

LEANN's recomputation feature provides exact distance calculations but can be disabled for extreme QPS requirements:

--no-recompute  # Disable selective recomputation

Trade-offs:

  • With recomputation (default): Exact distances, best quality, higher latency
  • Without recomputation: Approximate distances via PQ, 2-5x faster, significantly lower memory and storage usage

Disable when:

  • QPS requirements > 1000/sec
  • Slight accuracy loss is acceptable
  • Running on resource-constrained hardware

Performance Monitoring

Key metrics to watch:

  • Index build time
  • Query latency (p50, p95, p99)
  • Memory usage during build and search
  • Disk I/O patterns (for DiskANN)
  • Recomputation ratio (% of candidates recomputed)

Further Reading