From 0d448c4a4187a35f084638641eb50861cbdfbd74 Mon Sep 17 00:00:00 2001
From: Andy Lee <andylizf@outlook.com>
Date: Mon, 4 Aug 2025 22:50:32 -0700
Subject: [PATCH] docs: config guidance (#17)

* docs: config guidance

* feat: add comprehensive configuration guide and update README

- Create docs/configuration-guide.md with detailed guidance on:
  - Embedding model selection (small/medium/large)
  - Index selection (HNSW vs DiskANN)
  - LLM engine and model comparison
  - Parameter tuning (build/search complexity, top-k)
  - Performance optimization tips
  - Deep dive into LEANN's recomputation feature
- Update README.md to link to the configuration guide
- Include latest 2025 model recommendations (Qwen3, DeepSeek-R1, O3-mini)

* chore: move evaluation data .gitattributes to correct location

* docs: Weaken DiskANN emphasis in README

- Change backend description to emphasize HNSW as default
- DiskANN positioned as optional for billion-scale datasets
- Simplify evaluation commands to be more generic

* docs: Adjust DiskANN positioning in features and roadmap

- features.md: Put HNSW/FAISS first as default, DiskANN as optional
- roadmap.md: Reorder to show HNSW integration before DiskANN
- Consistent with positioning DiskANN as advanced option for large-scale use

* docs: Improve configuration guide based on feedback

- List specific files in default data/ directory (2 AI papers, literature, tech report)
- Update examples to use English and better RAG-suitable queries
- Change full dataset reference to use --max-items -1
- Adjust small model guidance about upgrading to larger models when time allows
- Update top-k defaults to reflect actual default of 20
- Ensure consistent use of full model name Qwen/Qwen3-Embedding-0.6B
- Reorder optimization steps, move MLX to third position
- Remove incorrect chunk size tuning guidance
- Change README from 'Having trouble' to 'Need best practices'

* docs: Address all configuration guide feedback

- Fix grammar: 'If time is not a constraint' instead of 'time expense is not large'
- Highlight Qwen3-Embedding-0.6B performance (nearly OpenAI API level)
- Add OpenAI quick start section with configuration example
- Fold Cloud vs Local trade-offs into collapsible section
- Update HNSW as 'default and recommended for extreme low storage'
- Add DiskANN beta warning and explain PQ+rerank architecture
- Expand Ollama models: add qwen3:0.6b, 4b, 7b variants
- Note OpenAI as current default but recommend Ollama switch
- Add 'need to install extra software' warning for Ollama
- Remove incorrect latency numbers from search-complexity recommendations

* docs: add a link
---
 README.md                                     |   7 +-
 {data => benchmarks/data}/.gitattributes      |   0
 docs/configuration-guide.md                   | 236 ++++++++++++++++++
 docs/features.md                              |   2 +-
 docs/roadmap.md                               |   2 +-
 .../leann-backend-diskann/third_party/DiskANN |   2 +-
 6 files changed, 243 insertions(+), 6 deletions(-)
 rename {data => benchmarks/data}/.gitattributes (100%)
 create mode 100644 docs/configuration-guide.md
diff --git a/README.md b/README.md
index 332c11c..5fa5248 100755
--- a/README.md
+++ b/README.md
@@ -170,6 +170,8 @@ ollama pull llama3.2:1b
 
 LEANN provides flexible parameters for embedding models, search strategies, and data processing to fit your specific needs.
 
+📚 **Need configuration best practices?** Check our [Configuration Guide](docs/configuration-guide.md) for detailed optimization tips, model selection advice, and solutions to common issues like slow embeddings or poor search quality.
+
 <details>
 <summary><strong>📋 Click to expand: Common Parameters (Available in All Examples)</strong></summary>
 
@@ -514,7 +516,7 @@ Options:
 - **Dynamic batching:** Efficiently batch embedding computations for GPU utilization
 - **Two-level search:** Smart graph traversal that prioritizes promising nodes
 
-**Backends:** DiskANN or HNSW - pick what works for your data size.
+**Backends:** HNSW (default) for most use cases, with optional DiskANN support for billion-scale datasets.
 
 ## Benchmarks
 
@@ -534,8 +536,7 @@ Options:
 
 ```bash
 uv pip install -e ".[dev]"  # Install dev dependencies
-python benchmarks/run_evaluation.py data/indices/dpr/dpr_diskann      # DPR dataset
-python benchmarks/run_evaluation.py data/indices/rpj_wiki/rpj_wiki.index  # Wikipedia
+python benchmarks/run_evaluation.py    # Will auto-download evaluation data and run benchmarks
 ```
 
 The evaluation script downloads data automatically on first run. The last three results were tested with partial personal data, and you can reproduce them with your own data!
diff --git a/data/.gitattributes b/benchmarks/data/.gitattributes
similarity index 100%
rename from data/.gitattributes
rename to benchmarks/data/.gitattributes
diff --git a/docs/configuration-guide.md b/docs/configuration-guide.md
new file mode 100644
index 0000000..1546440
--- /dev/null
+++ b/docs/configuration-guide.md
@@ -0,0 +1,236 @@
+# LEANN Configuration Guide
+
+This guide helps you optimize LEANN for different use cases and understand the trade-offs between various configuration options.
+
+## Getting Started: Simple is Better
+
+When first trying LEANN, start with a small dataset to quickly validate your approach:
+
+**For document RAG**: The default `data/` directory works perfectly - includes 2 AI research papers, Pride and Prejudice literature, and a technical report
+```bash
+python -m apps.document_rag --query "What techniques does LEANN use?"
+```
+
+**For other data sources**: Limit the dataset size for quick testing
+```bash
+# WeChat: Test with recent messages only
+python -m apps.wechat_rag --max-items 100 --query "What did we discuss about the project timeline?"
+
+# Browser history: Last few days
+python -m apps.browser_rag --max-items 500 --query "Find documentation about vector databases"
+
+# Email: Recent inbox
+python -m apps.email_rag --max-items 200 --query "Who sent updates about the deployment status?"
+```
+
+Once validated, scale up gradually:
+- 100 documents → 1,000 → 10,000 → full dataset (`--max-items -1`)
+- This helps identify issues early before committing to long processing times
+
+## Embedding Model Selection: Understanding the Trade-offs
+
+Based on our experience developing LEANN, embedding models fall into three categories:
+
+### Small Models (< 100M parameters)
+**Example**: `sentence-transformers/all-MiniLM-L6-v2` (22M params)
+- **Pros**: Lightweight, fast for both indexing and inference
+- **Cons**: Lower semantic understanding, may miss nuanced relationships
+- **Use when**: Speed is critical, handling simple queries, interactive mode, or just experimenting with LEANN. If time is not a constraint, consider using a larger/better embedding model
+
+### Medium Models (100M-500M parameters)
+**Example**: `facebook/contriever` (110M params), `BAAI/bge-base-en-v1.5` (110M params)
+- **Pros**: Balanced performance, good multilingual support, reasonable speed
+- **Cons**: Requires more compute than small models
+- **Use when**: Need quality results without extreme compute requirements, general-purpose RAG applications
+
+### Large Models (500M+ parameters)
+**Example**: `Qwen/Qwen3-Embedding-0.6B` (600M params), `intfloat/multilingual-e5-large` (560M params)
+- **Pros**: Best semantic understanding, captures complex relationships, excellent multilingual support. **Qwen3-Embedding-0.6B achieves nearly OpenAI API performance!**
+- **Cons**: Slower inference, longer index build times
+- **Use when**: Quality is paramount and you have sufficient compute resources. **Highly recommended** for production use
+
+### Quick Start: OpenAI Embeddings (Fastest Setup)
+
+For immediate testing without local model downloads:
+```bash
+# Set OpenAI embeddings (requires OPENAI_API_KEY)
+--embedding-mode openai --embedding-model text-embedding-3-small
+```
+
+<details>
+<summary><strong>Cloud vs Local Trade-offs</strong></summary>
+
+**OpenAI Embeddings** (`text-embedding-3-small/large`)
+- **Pros**: No local compute needed, consistently fast, high quality
+- **Cons**: Requires API key, costs money, data leaves your system, [known limitations with certain languages](https://yichuan-w.github.io/blog/lessons_learned_in_dev_leann/)
+- **When to use**: Prototyping, non-sensitive data, need immediate results
+
+**Local Embeddings**
+- **Pros**: Complete privacy, no ongoing costs, full control, can sometimes outperform OpenAI embeddings
+- **Cons**: Slower than cloud APIs, requires local compute resources
+- **When to use**: Production systems, sensitive data, cost-sensitive applications
+
+</details>
+
+## Index Selection: Matching Your Scale
+
+### HNSW (Hierarchical Navigable Small World)
+**Best for**: Small to medium datasets (< 10M vectors) - **Default and recommended for extreme low storage**
+- Full recomputation required
+- High memory usage during build phase
+- Excellent recall (95%+)
+
+```bash
+# Optimal for most use cases
+--backend-name hnsw --graph-degree 32 --build-complexity 64
+```
+
+### DiskANN
+**Best for**: Large datasets (> 10M vectors, 10GB+ index size) - **⚠️ Beta version, still in active development**
+- Uses Product Quantization (PQ) for coarse filtering during graph traversal
+- Novel approach: stores only PQ codes, performs rerank with exact computation in final step
+- Implements a corner case of double-queue: prunes all neighbors and recomputes at the end
+
+```bash
+# For billion-scale deployments
+--backend-name diskann --graph-degree 64 --build-complexity 128
+```
+
+## LLM Selection: Engine and Model Comparison
+
+### LLM Engines
+
+**OpenAI** (`--llm openai`)
+- **Pros**: Best quality, consistent performance, no local resources needed
+- **Cons**: Costs money ($0.15-2.5 per million tokens), requires internet, data privacy concerns
+- **Models**: `gpt-4o-mini` (fast, cheap), `gpt-4o` (best quality), `o3-mini` (reasoning, not so expensive)
+- **Note**: Our current default, but we recommend switching to Ollama for most use cases
+
+**Ollama** (`--llm ollama`)
+- **Pros**: Fully local, free, privacy-preserving, good model variety
+- **Cons**: Requires local GPU/CPU resources, slower than cloud APIs, need to install extra [ollama app](https://github.com/ollama/ollama?tab=readme-ov-file#ollama) and pre-download models by `ollama pull`
+- **Models**: `qwen3:0.6b` (ultra-fast), `qwen3:1.7b` (balanced), `qwen3:4b` (good quality), `qwen3:7b` (high quality), `deepseek-r1:1.5b` (reasoning)
+
+**HuggingFace** (`--llm hf`)
+- **Pros**: Free tier available, huge model selection, direct model loading (vs Ollama's server-based approach)
+- **Cons**: More complex initial setup
+- **Models**: `Qwen/Qwen3-1.7B-FP8`
+
+## Parameter Tuning Guide
+
+### Search Complexity Parameters
+
+**`--build-complexity`** (index building)
+- Controls thoroughness during index construction
+- Higher = better recall but slower build
+- Recommendations:
+  - 32: Quick prototyping
+  - 64: Balanced (default)
+  - 128: Production systems
+  - 256: Maximum quality
+
+**`--search-complexity`** (query time)
+- Controls search thoroughness
+- Higher = better results but slower
+- Recommendations:
+  - 16: Fast/Interactive search
+  - 32: High quality with diversity
+  - 64+: Maximum accuracy
+
+### Top-K Selection
+
+**`--top-k`** (number of retrieved chunks)
+- More chunks = better context but slower LLM processing
+- Should be always smaller than `--search-complexity`
+- Guidelines:
+  - 10-20: General questions (default: 20)
+  - 30+: Complex multi-hop reasoning requiring comprehensive context
+
+**Trade-off formula**:
+- Retrieval time ∝ log(n) × search_complexity
+- LLM processing time ∝ top_k × chunk_size
+- Total context = top_k × chunk_size tokens
+
+### Graph Degree (HNSW/DiskANN)
+
+**`--graph-degree`**
+- Number of connections per node in the graph
+- Higher = better recall but more memory
+- HNSW: 16-32 (default: 32)
+- DiskANN: 32-128 (default: 64)
+
+
+## Performance Optimization Checklist
+
+### If Embedding is Too Slow
+
+1. **Switch to smaller model**:
+   ```bash
+   # From large model
+   --embedding-model Qwen/Qwen3-Embedding-0.6B
+   # To small model
+   --embedding-model sentence-transformers/all-MiniLM-L6-v2
+   ```
+
+2. **Limit dataset size for testing**:
+   ```bash
+   --max-items 1000  # Process first 1k items only
+   ```
+
+3. **Use MLX on Apple Silicon** (optional optimization):
+   ```bash
+   --embedding-mode mlx --embedding-model mlx-community/multilingual-e5-base-mlx
+   ```
+
+### If Search Quality is Poor
+
+1. **Increase retrieval count**:
+   ```bash
+   --top-k 30  # Retrieve more candidates
+   ```
+
+2. **Upgrade embedding model**:
+   ```bash
+   # For English
+   --embedding-model BAAI/bge-base-en-v1.5
+   # For multilingual
+   --embedding-model intfloat/multilingual-e5-large
+   ```
+
+## Understanding the Trade-offs
+
+Every configuration choice involves trade-offs:
+
+| Factor | Small/Fast | Large/Quality |
+|--------|------------|---------------|
+| Embedding Model | `all-MiniLM-L6-v2` | `Qwen/Qwen3-Embedding-0.6B` |
+| Chunk Size | 512 tokens | 128 tokens |
+| Index Type | HNSW | DiskANN |
+| LLM | `qwen3:1.7b` | `gpt-4o` |
+
+The key is finding the right balance for your specific use case. Start small and simple, measure performance, then scale up only where needed.
+
+## Deep Dive: Critical Configuration Decisions
+
+### When to Disable Recomputation
+
+LEANN's recomputation feature provides exact distance calculations but can be disabled for extreme QPS requirements:
+
+```bash
+--no-recompute  # Disable selective recomputation
+```
+
+**Trade-offs**:
+- **With recomputation** (default): Exact distances, best quality, higher latency, minimal storage (only stores metadata, recomputes embeddings on-demand)
+- **Without recomputation**: Must store full embeddings, significantly higher memory and storage usage (10-100x more), but faster search
+
+**Disable when**:
+- You have abundant storage and memory
+- Need extremely low latency (< 100ms)
+- Running a read-heavy workload where storage cost is acceptable
+
+## Further Reading
+
+- [Lessons Learned Developing LEANN](https://yichuan-w.github.io/blog/lessons_learned_in_dev_leann/)
+- [LEANN Technical Paper](https://arxiv.org/abs/2506.08276)
+- [DiskANN Original Paper](https://papers.nips.cc/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf)
diff --git a/docs/features.md b/docs/features.md
index 51c0c4f..da4e495 100644
--- a/docs/features.md
+++ b/docs/features.md
@@ -5,7 +5,7 @@
 - **🔄 Real-time Embeddings** - Eliminate heavy embedding storage with dynamic computation using optimized ZMQ servers and highly optimized search paradigm (overlapping and batching) with highly optimized embedding engine
 - **📈 Scalable Architecture** - Handles millions of documents on consumer hardware; the larger your dataset, the more LEANN can save
 - **🎯 Graph Pruning** - Advanced techniques to minimize the storage overhead of vector search to a limited footprint
-- **🏗️ Pluggable Backends** - DiskANN, HNSW/FAISS with unified API
+- **🏗️ Pluggable Backends** - HNSW/FAISS (default), with optional DiskANN for large-scale deployments
 
 ## 🛠️ Technical Highlights
 - **🔄 Recompute Mode** - Highest accuracy scenarios while eliminating vector storage overhead
diff --git a/docs/roadmap.md b/docs/roadmap.md
index c9446df..fa04b5c 100644
--- a/docs/roadmap.md
+++ b/docs/roadmap.md
@@ -2,8 +2,8 @@
 
 ## 🎯 Q2 2025
 
-- [X] DiskANN backend with MIPS/L2/Cosine support
 - [X] HNSW backend integration
+- [X] DiskANN backend with MIPS/L2/Cosine support
 - [X] Real-time embedding pipeline
 - [X] Memory-efficient graph pruning
 
diff --git a/packages/leann-backend-diskann/third_party/DiskANN b/packages/leann-backend-diskann/third_party/DiskANN
index af2a264..67a2611 160000
--- a/packages/leann-backend-diskann/third_party/DiskANN
+++ b/packages/leann-backend-diskann/third_party/DiskANN
@@ -1 +1 @@
-Subproject commit af2a26481e65232b57b82d96e68833cdee9f7635
+Subproject commit 67a2611ad14bc11d84dfdb554c5567cfb78a2656