docs: Improve configuration guide based on feedback

- List specific files in default data/ directory (2 AI papers, literature, tech report)
- Update examples to use English and better RAG-suitable queries
- Change full dataset reference to use --max-items -1
- Adjust small model guidance about upgrading to larger models when time allows
- Update top-k defaults to reflect actual default of 20
- Ensure consistent use of full model name Qwen/Qwen3-Embedding-0.6B
- Reorder optimization steps, move MLX to third position
- Remove incorrect chunk size tuning guidance
- Change README from 'Having trouble' to 'Need best practices'
This commit is contained in:
Andy Lee
2025-08-04 19:29:17 -07:00
parent 00f506c0bd
commit d9b6f195c5
2 changed files with 17 additions and 23 deletions

View File

@@ -170,7 +170,7 @@ ollama pull llama3.2:1b
LEANN provides flexible parameters for embedding models, search strategies, and data processing to fit your specific needs. LEANN provides flexible parameters for embedding models, search strategies, and data processing to fit your specific needs.
📚 **Having trouble with configuration?** Check our [Configuration Guide](docs/configuration-guide.md) for detailed optimization tips, model selection advice, and solutions to common issues like slow embeddings or poor search quality. 📚 **Need configuration best practices?** Check our [Configuration Guide](docs/configuration-guide.md) for detailed optimization tips, model selection advice, and solutions to common issues like slow embeddings or poor search quality.
<details> <details>
<summary><strong>📋 Click to expand: Common Parameters (Available in All Examples)</strong></summary> <summary><strong>📋 Click to expand: Common Parameters (Available in All Examples)</strong></summary>

View File

@@ -6,7 +6,7 @@ This guide helps you optimize LEANN for different use cases and understand the t
When first trying LEANN, start with a small dataset to quickly validate your approach: When first trying LEANN, start with a small dataset to quickly validate your approach:
**For document RAG**: The default `data/` directory works perfectly - just a few PDFs let you test in minutes **For document RAG**: The default `data/` directory works perfectly - includes 2 AI research papers, Pride and Prejudice literature, and a technical report
```bash ```bash
python -m apps.document_rag --query "What techniques does LEANN use?" python -m apps.document_rag --query "What techniques does LEANN use?"
``` ```
@@ -14,17 +14,17 @@ python -m apps.document_rag --query "What techniques does LEANN use?"
**For other data sources**: Limit the dataset size for quick testing **For other data sources**: Limit the dataset size for quick testing
```bash ```bash
# WeChat: Test with recent messages only # WeChat: Test with recent messages only
python -m apps.wechat_rag --max-items 100 --query "昨天聊了什么" python -m apps.wechat_rag --max-items 100 --query "What did we discuss about the project timeline?"
# Browser history: Last few days # Browser history: Last few days
python -m apps.browser_rag --max-items 500 --query "AI papers I read" python -m apps.browser_rag --max-items 500 --query "Find documentation about vector databases"
# Email: Recent inbox # Email: Recent inbox
python -m apps.email_rag --max-items 200 --query "meeting schedules" python -m apps.email_rag --max-items 200 --query "Who sent updates about the deployment status?"
``` ```
Once validated, scale up gradually: Once validated, scale up gradually:
- 100 documents → 1,000 → 10,000 → full dataset - 100 documents → 1,000 → 10,000 → full dataset (`--max-items -1`)
- This helps identify issues early before committing to long processing times - This helps identify issues early before committing to long processing times
## Embedding Model Selection: Understanding the Trade-offs ## Embedding Model Selection: Understanding the Trade-offs
@@ -35,7 +35,7 @@ Based on our experience developing LEANN, embedding models fall into three categ
**Example**: `sentence-transformers/all-MiniLM-L6-v2` (22M params) **Example**: `sentence-transformers/all-MiniLM-L6-v2` (22M params)
- **Pros**: Lightweight, fast for both indexing and inference - **Pros**: Lightweight, fast for both indexing and inference
- **Cons**: Lower semantic understanding, may miss nuanced relationships - **Cons**: Lower semantic understanding, may miss nuanced relationships
- **Use when**: Speed is critical, handling simple queries, on interactive mode or just experimenting with LEANN - **Use when**: Speed is critical, handling simple queries, interactive mode or just experimenting with LEANN. If time expense is not large, consider using a larger/better embedding model
### Medium Models (100M-500M parameters) ### Medium Models (100M-500M parameters)
**Example**: `facebook/contriever` (110M params), `BAAI/bge-base-en-v1.5` (110M params) **Example**: `facebook/contriever` (110M params), `BAAI/bge-base-en-v1.5` (110M params)
@@ -130,9 +130,8 @@ Based on our experience developing LEANN, embedding models fall into three categ
- More chunks = better context but slower LLM processing - More chunks = better context but slower LLM processing
- Should be always smaller than `--search-complexity` - Should be always smaller than `--search-complexity`
- Guidelines: - Guidelines:
- 3-5: Simple factual queries - 10-20: General questions (default: 20)
- 5-10: General questions (default) - 30+: Complex multi-hop reasoning requiring comprehensive context
- 10+: Complex multi-hop reasoning
**Trade-off formula**: **Trade-off formula**:
- Retrieval time ∝ log(n) × search_complexity - Retrieval time ∝ log(n) × search_complexity
@@ -155,21 +154,21 @@ Based on our experience developing LEANN, embedding models fall into three categ
1. **Switch to smaller model**: 1. **Switch to smaller model**:
```bash ```bash
# From large model # From large model
--embedding-model Qwen/Qwen3-Embedding --embedding-model Qwen/Qwen3-Embedding-0.6B
# To small model # To small model
--embedding-model sentence-transformers/all-MiniLM-L6-v2 --embedding-model sentence-transformers/all-MiniLM-L6-v2
``` ```
2. **Use MLX on Apple Silicon**: 2. **Limit dataset size for testing**:
```bash
--embedding-mode mlx --embedding-model mlx-community/multilingual-e5-base-mlx
```
3. **Limit dataset size for testing**:
```bash ```bash
--max-items 1000 # Process first 1k items only --max-items 1000 # Process first 1k items only
``` ```
3. **Use MLX on Apple Silicon** (optional optimization):
```bash
--embedding-mode mlx --embedding-model mlx-community/multilingual-e5-base-mlx
```
### If Search Quality is Poor ### If Search Quality is Poor
1. **Increase retrieval count**: 1. **Increase retrieval count**:
@@ -177,12 +176,7 @@ Based on our experience developing LEANN, embedding models fall into three categ
--top-k 30 # Retrieve more candidates --top-k 30 # Retrieve more candidates
``` ```
2. **Tune chunk size for your content**: 2. **Upgrade embedding model**:
- Technical docs: `--chunk-size 512`
- Chat messages: `--chunk-size 128`
- Mixed content: `--chunk-size 256`
3. **Upgrade embedding model**:
```bash ```bash
# For English # For English
--embedding-model BAAI/bge-base-en-v1.5 --embedding-model BAAI/bge-base-en-v1.5