Files

Andy Lee 0d448c4a41 docs: config guidance (#17 )

* docs: config guidance

* feat: add comprehensive configuration guide and update README

- Create docs/configuration-guide.md with detailed guidance on:
  - Embedding model selection (small/medium/large)
  - Index selection (HNSW vs DiskANN)
  - LLM engine and model comparison
  - Parameter tuning (build/search complexity, top-k)
  - Performance optimization tips
  - Deep dive into LEANN's recomputation feature
- Update README.md to link to the configuration guide
- Include latest 2025 model recommendations (Qwen3, DeepSeek-R1, O3-mini)

* chore: move evaluation data .gitattributes to correct location

* docs: Weaken DiskANN emphasis in README

- Change backend description to emphasize HNSW as default
- DiskANN positioned as optional for billion-scale datasets
- Simplify evaluation commands to be more generic

* docs: Adjust DiskANN positioning in features and roadmap

- features.md: Put HNSW/FAISS first as default, DiskANN as optional
- roadmap.md: Reorder to show HNSW integration before DiskANN
- Consistent with positioning DiskANN as advanced option for large-scale use

* docs: Improve configuration guide based on feedback

- List specific files in default data/ directory (2 AI papers, literature, tech report)
- Update examples to use English and better RAG-suitable queries
- Change full dataset reference to use --max-items -1
- Adjust small model guidance about upgrading to larger models when time allows
- Update top-k defaults to reflect actual default of 20
- Ensure consistent use of full model name Qwen/Qwen3-Embedding-0.6B
- Reorder optimization steps, move MLX to third position
- Remove incorrect chunk size tuning guidance
- Change README from 'Having trouble' to 'Need best practices'

* docs: Address all configuration guide feedback

- Fix grammar: 'If time is not a constraint' instead of 'time expense is not large'
- Highlight Qwen3-Embedding-0.6B performance (nearly OpenAI API level)
- Add OpenAI quick start section with configuration example
- Fold Cloud vs Local trade-offs into collapsible section
- Update HNSW as 'default and recommended for extreme low storage'
- Add DiskANN beta warning and explain PQ+rerank architecture
- Expand Ollama models: add qwen3:0.6b, 4b, 7b variants
- Note OpenAI as current default but recommend Ollama switch
- Add 'need to install extra software' warning for Ollama
- Remove incorrect latency numbers from search-complexity recommendations

* docs: add a link

2025-08-04 22:50:32 -07:00

9.0 KiB

Raw Permalink Blame History

LEANN Configuration Guide

This guide helps you optimize LEANN for different use cases and understand the trade-offs between various configuration options.

Getting Started: Simple is Better

When first trying LEANN, start with a small dataset to quickly validate your approach:

For document RAG: The default data/ directory works perfectly - includes 2 AI research papers, Pride and Prejudice literature, and a technical report

python -m apps.document_rag --query "What techniques does LEANN use?"

For other data sources: Limit the dataset size for quick testing

# WeChat: Test with recent messages only
python -m apps.wechat_rag --max-items 100 --query "What did we discuss about the project timeline?"

# Browser history: Last few days
python -m apps.browser_rag --max-items 500 --query "Find documentation about vector databases"

# Email: Recent inbox
python -m apps.email_rag --max-items 200 --query "Who sent updates about the deployment status?"

Once validated, scale up gradually:

100 documents → 1,000 → 10,000 → full dataset (--max-items -1)
This helps identify issues early before committing to long processing times

Embedding Model Selection: Understanding the Trade-offs

Based on our experience developing LEANN, embedding models fall into three categories:

Small Models (< 100M parameters)

Example: sentence-transformers/all-MiniLM-L6-v2 (22M params)

Pros: Lightweight, fast for both indexing and inference
Cons: Lower semantic understanding, may miss nuanced relationships
Use when: Speed is critical, handling simple queries, interactive mode, or just experimenting with LEANN. If time is not a constraint, consider using a larger/better embedding model

Medium Models (100M-500M parameters)

Example: facebook/contriever (110M params), BAAI/bge-base-en-v1.5 (110M params)

Pros: Balanced performance, good multilingual support, reasonable speed
Cons: Requires more compute than small models
Use when: Need quality results without extreme compute requirements, general-purpose RAG applications

Large Models (500M+ parameters)

Example: Qwen/Qwen3-Embedding-0.6B (600M params), intfloat/multilingual-e5-large (560M params)

Pros: Best semantic understanding, captures complex relationships, excellent multilingual support. Qwen3-Embedding-0.6B achieves nearly OpenAI API performance!
Cons: Slower inference, longer index build times
Use when: Quality is paramount and you have sufficient compute resources. Highly recommended for production use

Quick Start: OpenAI Embeddings (Fastest Setup)

For immediate testing without local model downloads:

# Set OpenAI embeddings (requires OPENAI_API_KEY)
--embedding-mode openai --embedding-model text-embedding-3-small

Cloud vs Local Trade-offs

OpenAI Embeddings (text-embedding-3-small/large)

Pros: No local compute needed, consistently fast, high quality
Cons: Requires API key, costs money, data leaves your system, known limitations with certain languages
When to use: Prototyping, non-sensitive data, need immediate results

Local Embeddings

Pros: Complete privacy, no ongoing costs, full control, can sometimes outperform OpenAI embeddings
Cons: Slower than cloud APIs, requires local compute resources
When to use: Production systems, sensitive data, cost-sensitive applications

Index Selection: Matching Your Scale

HNSW (Hierarchical Navigable Small World)

Best for: Small to medium datasets (< 10M vectors) - Default and recommended for extreme low storage

Full recomputation required
High memory usage during build phase
Excellent recall (95%+)

# Optimal for most use cases
--backend-name hnsw --graph-degree 32 --build-complexity 64

DiskANN

Best for: Large datasets (> 10M vectors, 10GB+ index size) - ⚠️ Beta version, still in active development

Uses Product Quantization (PQ) for coarse filtering during graph traversal
Novel approach: stores only PQ codes, performs rerank with exact computation in final step
Implements a corner case of double-queue: prunes all neighbors and recomputes at the end

# For billion-scale deployments
--backend-name diskann --graph-degree 64 --build-complexity 128

LLM Selection: Engine and Model Comparison

LLM Engines

OpenAI (--llm openai)

Pros: Best quality, consistent performance, no local resources needed
Cons: Costs money ($0.15-2.5 per million tokens), requires internet, data privacy concerns
Models: gpt-4o-mini (fast, cheap), gpt-4o (best quality), o3-mini (reasoning, not so expensive)
Note: Our current default, but we recommend switching to Ollama for most use cases

Ollama (--llm ollama)

Pros: Fully local, free, privacy-preserving, good model variety
Cons: Requires local GPU/CPU resources, slower than cloud APIs, need to install extra ollama app and pre-download models by ollama pull
Models: qwen3:0.6b (ultra-fast), qwen3:1.7b (balanced), qwen3:4b (good quality), qwen3:7b (high quality), deepseek-r1:1.5b (reasoning)

HuggingFace (--llm hf)

Pros: Free tier available, huge model selection, direct model loading (vs Ollama's server-based approach)
Cons: More complex initial setup
Models: Qwen/Qwen3-1.7B-FP8

Parameter Tuning Guide

Search Complexity Parameters

--build-complexity (index building)

Controls thoroughness during index construction
Higher = better recall but slower build
Recommendations:
- 32: Quick prototyping
- 64: Balanced (default)
- 128: Production systems
- 256: Maximum quality

--search-complexity (query time)

Controls search thoroughness
Higher = better results but slower
Recommendations:
- 16: Fast/Interactive search
- 32: High quality with diversity
- 64+: Maximum accuracy

Top-K Selection

--top-k (number of retrieved chunks)

More chunks = better context but slower LLM processing
Should be always smaller than --search-complexity
Guidelines:
- 10-20: General questions (default: 20)
- 30+: Complex multi-hop reasoning requiring comprehensive context

Trade-off formula:

Retrieval time ∝ log(n) × search_complexity
LLM processing time ∝ top_k × chunk_size
Total context = top_k × chunk_size tokens

Graph Degree (HNSW/DiskANN)

--graph-degree

Number of connections per node in the graph
Higher = better recall but more memory
HNSW: 16-32 (default: 32)
DiskANN: 32-128 (default: 64)

Performance Optimization Checklist

If Embedding is Too Slow

Switch to smaller model:

# From large model
--embedding-model Qwen/Qwen3-Embedding-0.6B
# To small model
--embedding-model sentence-transformers/all-MiniLM-L6-v2

Limit dataset size for testing:

--max-items 1000  # Process first 1k items only

Use MLX on Apple Silicon (optional optimization):

--embedding-mode mlx --embedding-model mlx-community/multilingual-e5-base-mlx

If Search Quality is Poor

Increase retrieval count:
```
--top-k 30  # Retrieve more candidates
```

Upgrade embedding model:

# For English
--embedding-model BAAI/bge-base-en-v1.5
# For multilingual
--embedding-model intfloat/multilingual-e5-large

Understanding the Trade-offs

Every configuration choice involves trade-offs:

Factor	Small/Fast	Large/Quality
Embedding Model	`all-MiniLM-L6-v2`	`Qwen/Qwen3-Embedding-0.6B`
Chunk Size	512 tokens	128 tokens
Index Type	HNSW	DiskANN
LLM	`qwen3:1.7b`	`gpt-4o`

The key is finding the right balance for your specific use case. Start small and simple, measure performance, then scale up only where needed.

Deep Dive: Critical Configuration Decisions

When to Disable Recomputation

LEANN's recomputation feature provides exact distance calculations but can be disabled for extreme QPS requirements:

--no-recompute  # Disable selective recomputation

Trade-offs:

With recomputation (default): Exact distances, best quality, higher latency, minimal storage (only stores metadata, recomputes embeddings on-demand)
Without recomputation: Must store full embeddings, significantly higher memory and storage usage (10-100x more), but faster search

Disable when:

You have abundant storage and memory
Need extremely low latency (< 100ms)
Running a read-heavy workload where storage cost is acceptable

9.0 KiB Raw Permalink Blame History Unescape Escape