Files
LEANN/docs/configuration-guide.md
2025-08-09 16:49:51 -07:00

12 KiB
Raw Blame History

LEANN Configuration Guide

This guide helps you optimize LEANN for different use cases and understand the trade-offs between various configuration options.

Getting Started: Simple is Better

When first trying LEANN, start with a small dataset to quickly validate your approach:

For document RAG: The default data/ directory works perfectly - includes 2 AI research papers, Pride and Prejudice literature, and a technical report

python -m apps.document_rag --query "What techniques does LEANN use?"

For other data sources: Limit the dataset size for quick testing

# WeChat: Test with recent messages only
python -m apps.wechat_rag --max-items 100 --query "What did we discuss about the project timeline?"

# Browser history: Last few days
python -m apps.browser_rag --max-items 500 --query "Find documentation about vector databases"

# Email: Recent inbox
python -m apps.email_rag --max-items 200 --query "Who sent updates about the deployment status?"

Once validated, scale up gradually:

  • 100 documents → 1,000 → 10,000 → full dataset (--max-items -1)
  • This helps identify issues early before committing to long processing times

Embedding Model Selection: Understanding the Trade-offs

Based on our experience developing LEANN, embedding models fall into three categories:

Small Models (< 100M parameters)

Example: sentence-transformers/all-MiniLM-L6-v2 (22M params)

  • Pros: Lightweight, fast for both indexing and inference
  • Cons: Lower semantic understanding, may miss nuanced relationships
  • Use when: Speed is critical, handling simple queries, interactive mode, or just experimenting with LEANN. If time is not a constraint, consider using a larger/better embedding model

Medium Models (100M-500M parameters)

Example: facebook/contriever (110M params), BAAI/bge-base-en-v1.5 (110M params)

  • Pros: Balanced performance, good multilingual support, reasonable speed
  • Cons: Requires more compute than small models
  • Use when: Need quality results without extreme compute requirements, general-purpose RAG applications

Large Models (500M+ parameters)

Example: Qwen/Qwen3-Embedding-0.6B (600M params), intfloat/multilingual-e5-large (560M params)

  • Pros: Best semantic understanding, captures complex relationships, excellent multilingual support. Qwen3-Embedding-0.6B achieves nearly OpenAI API performance!
  • Cons: Slower inference, longer index build times
  • Use when: Quality is paramount and you have sufficient compute resources. Highly recommended for production use

Quick Start: Cloud and Local Embedding Options

OpenAI Embeddings (Fastest Setup) For immediate testing without local model downloads:

# Set OpenAI embeddings (requires OPENAI_API_KEY)
--embedding-mode openai --embedding-model text-embedding-3-small

Ollama Embeddings (Privacy-Focused) For local embeddings with complete privacy:

# First, pull an embedding model
ollama pull nomic-embed-text

# Use Ollama embeddings
--embedding-mode ollama --embedding-model nomic-embed-text
Cloud vs Local Trade-offs

OpenAI Embeddings (text-embedding-3-small/large)

  • Pros: No local compute needed, consistently fast, high quality
  • Cons: Requires API key, costs money, data leaves your system, known limitations with certain languages
  • When to use: Prototyping, non-sensitive data, need immediate results

Local Embeddings

  • Pros: Complete privacy, no ongoing costs, full control, can sometimes outperform OpenAI embeddings
  • Cons: Slower than cloud APIs, requires local compute resources
  • When to use: Production systems, sensitive data, cost-sensitive applications

Index Selection: Matching Your Scale

HNSW (Hierarchical Navigable Small World)

Best for: Small to medium datasets (< 10M vectors) - Default and recommended for extreme low storage

  • Full recomputation required
  • High memory usage during build phase
  • Excellent recall (95%+)
# Optimal for most use cases
--backend-name hnsw --graph-degree 32 --build-complexity 64

DiskANN

Best for: Performance-critical applications and large datasets - Production-ready with automatic graph partitioning

How it works:

  • Product Quantization (PQ) + Real-time Reranking: Uses compressed PQ codes for fast graph traversal, then recomputes exact embeddings for final candidates
  • Automatic Graph Partitioning: When is_recompute=True, automatically partitions large indices and safely removes redundant files to save storage
  • Superior Speed-Accuracy Trade-off: Faster search than HNSW while maintaining high accuracy

Trade-offs compared to HNSW:

  • Faster search latency (typically 2-8x speedup)
  • Better scaling for large datasets
  • Smart storage management with automatic partitioning
  • Better graph locality with --ldg-times parameter for SSD optimization
  • ⚠️ Slightly larger index size due to PQ tables and graph metadata
# Recommended for most use cases
--backend-name diskann --graph-degree 32 --build-complexity 64

# For large-scale deployments
--backend-name diskann --graph-degree 64 --build-complexity 128

Performance Benchmark: Run python benchmarks/diskann_vs_hnsw_speed_comparison.py to compare DiskANN and HNSW on your system.

LLM Selection: Engine and Model Comparison

LLM Engines

OpenAI (--llm openai)

  • Pros: Best quality, consistent performance, no local resources needed
  • Cons: Costs money ($0.15-2.5 per million tokens), requires internet, data privacy concerns
  • Models: gpt-4o-mini (fast, cheap), gpt-4o (best quality), o3 (reasoning), o3-mini (reasoning, cheaper)
  • Thinking Budget: Use --thinking-budget low/medium/high for o-series reasoning models (o3, o3-mini, o4-mini)
  • Note: Our current default, but we recommend switching to Ollama for most use cases

Ollama (--llm ollama)

  • Pros: Fully local, free, privacy-preserving, good model variety
  • Cons: Requires local GPU/CPU resources, slower than cloud APIs, need to install extra ollama app and pre-download models by ollama pull
  • Models: qwen3:0.6b (ultra-fast), qwen3:1.7b (balanced), qwen3:4b (good quality), qwen3:7b (high quality), deepseek-r1:1.5b (reasoning)
  • Thinking Budget: Use --thinking-budget low/medium/high for reasoning models like GPT-Oss:20b

HuggingFace (--llm hf)

  • Pros: Free tier available, huge model selection, direct model loading (vs Ollama's server-based approach)
  • Cons: More complex initial setup
  • Models: Qwen/Qwen3-1.7B-FP8

Parameter Tuning Guide

Search Complexity Parameters

--build-complexity (index building)

  • Controls thoroughness during index construction
  • Higher = better recall but slower build
  • Recommendations:
    • 32: Quick prototyping
    • 64: Balanced (default)
    • 128: Production systems
    • 256: Maximum quality

--search-complexity (query time)

  • Controls search thoroughness
  • Higher = better results but slower
  • Recommendations:
    • 16: Fast/Interactive search
    • 32: High quality with diversity
    • 64+: Maximum accuracy

Top-K Selection

--top-k (number of retrieved chunks)

  • More chunks = better context but slower LLM processing
  • Should be always smaller than --search-complexity
  • Guidelines:
    • 10-20: General questions (default: 20)
    • 30+: Complex multi-hop reasoning requiring comprehensive context

Trade-off formula:

  • Retrieval time ∝ log(n) × search_complexity
  • LLM processing time ∝ top_k × chunk_size
  • Total context = top_k × chunk_size tokens

Thinking Budget for Reasoning Models

--thinking-budget (reasoning effort level)

  • Controls the computational effort for reasoning models
  • Options: low, medium, high
  • Guidelines:
    • low: Fast responses, basic reasoning (default for simple queries)
    • medium: Balanced speed and reasoning depth
    • high: Maximum reasoning effort, best for complex analytical questions
  • Supported Models:
    • Ollama: gpt-oss:20b, gpt-oss:120b
    • OpenAI: o3, o3-mini, o4-mini, o1 (o-series reasoning models)
  • Note: Models without reasoning support will show a warning and proceed without reasoning parameters
  • Example: --thinking-budget high for complex analytical questions

📖 For detailed usage examples and implementation details, check out Thinking Budget Documentation

💡 Quick Examples:

# OpenAI o-series reasoning model
python apps/document_rag.py --query "What are the main techniques LEANN explores?" \
  --index-dir hnswbuild --backend hnsw \
  --llm openai --llm-model o3 --thinking-budget medium

# Ollama reasoning model
python apps/document_rag.py --query "What are the main techniques LEANN explores?" \
  --index-dir hnswbuild --backend hnsw \
  --llm ollama --llm-model gpt-oss:20b --thinking-budget high

Graph Degree (HNSW/DiskANN)

--graph-degree

  • Number of connections per node in the graph
  • Higher = better recall but more memory
  • HNSW: 16-32 (default: 32)
  • DiskANN: 32-128 (default: 64)

Performance Optimization Checklist

If Embedding is Too Slow

  1. Switch to smaller model:

    # From large model
    --embedding-model Qwen/Qwen3-Embedding-0.6B
    # To small model
    --embedding-model sentence-transformers/all-MiniLM-L6-v2
    
  2. Limit dataset size for testing:

    --max-items 1000  # Process first 1k items only
    
  3. Use MLX on Apple Silicon (optional optimization):

    --embedding-mode mlx --embedding-model mlx-community/multilingual-e5-base-mlx
    

If Search Quality is Poor

  1. Increase retrieval count:

    --top-k 30  # Retrieve more candidates
    
  2. Upgrade embedding model:

    # For English
    --embedding-model BAAI/bge-base-en-v1.5
    # For multilingual
    --embedding-model intfloat/multilingual-e5-large
    

Understanding the Trade-offs

Every configuration choice involves trade-offs:

Factor Small/Fast Large/Quality
Embedding Model all-MiniLM-L6-v2 Qwen/Qwen3-Embedding-0.6B
Chunk Size 512 tokens 128 tokens
Index Type HNSW DiskANN
LLM qwen3:1.7b gpt-4o

The key is finding the right balance for your specific use case. Start small and simple, measure performance, then scale up only where needed.

Deep Dive: Critical Configuration Decisions

When to Disable Recomputation

LEANN's recomputation feature provides exact distance calculations but can be disabled for extreme QPS requirements:

--no-recompute  # Disable selective recomputation

Trade-offs:

  • With recomputation (default): Exact distances, best quality, higher latency, minimal storage (only stores metadata, recomputes embeddings on-demand)
  • Without recomputation: Must store full embeddings, significantly higher memory and storage usage (10-100x more), but faster search

Disable when:

  • You have abundant storage and memory
  • Need extremely low latency (< 100ms)
  • Running a read-heavy workload where storage cost is acceptable

Further Reading