refactor: Unify examples interface with BaseRAGExample

- Create BaseRAGExample base class for all RAG examples
- Refactor 4 examples to use unified interface:
  - document_rag.py (replaces main_cli_example.py)
  - email_rag.py (replaces mail_reader_leann.py)
  - browser_rag.py (replaces google_history_reader_leann.py)
  - wechat_rag.py (replaces wechat_history_reader_leann.py)
- Maintain 100% parameter compatibility with original files
- Add interactive mode support for all examples
- Unify parameter names (--max-items replaces --max-emails/--max-entries)
- Update README.md with new examples usage
- Add PARAMETER_CONSISTENCY.md documenting all parameter mappings
- Keep main_cli_example.py for backward compatibility with migration notice

All default values, LeannBuilder parameters, and chunking settings
remain identical to ensure full compatibility with existing indexes.
This commit is contained in:
Andy Lee
2025-07-28 23:11:16 -07:00
parent 19bcc07814
commit 46f6f76fc3
8 changed files with 988 additions and 180 deletions

100
README.md
View File

@@ -178,21 +178,39 @@ The example below asks a question about summarizing two papers (uses default dat
```bash
source .venv/bin/activate
python ./examples/main_cli_example.py
python ./examples/document_rag.py --query "What are the main techniques LEANN explores?"
```
<details>
<summary><strong>📋 Click to expand: User Configurable Arguments</strong></summary>
#### Core Parameters (All Examples Share These)
```bash
# Use custom index directory
python examples/main_cli_example.py --index-dir "./my_custom_index"
--index-dir DIR # Directory to store the index
--query "YOUR QUESTION" # Single query to run (interactive mode if omitted)
--max-items N # Max items to process (default: 1000, -1 for all)
--force-rebuild # Force rebuild index even if it exists
# Use custom data directory
python examples/main_cli_example.py --data-dir "./my_documents"
# Embedding Parameters
--embedding-model MODEL # e.g., facebook/contriever, text-embedding-3-small
--embedding-mode MODE # sentence-transformers, openai, or mlx
# Ask a specific question
python examples/main_cli_example.py --query "What are the main findings in these papers?"
# LLM Parameters
--llm TYPE # openai, ollama, or hf
--llm-model MODEL # e.g., gpt-4o, llama3.2:1b
--top-k N # Number of results to retrieve (default: 20)
```
#### Document-Specific Parameters
```bash
# Process custom documents
python examples/document_rag.py --data-dir "./my_documents" --file-types .pdf .txt .md
# Process with custom chunking
python examples/document_rag.py --chunk-size 512 --chunk-overlap 256
# Use different LLM
python examples/document_rag.py --llm ollama --llm-model llama3.2:1b
```
</details>
@@ -208,28 +226,29 @@ python examples/main_cli_example.py --query "What are the main findings in these
**Note:** You need to grant full disk access to your terminal/VS Code in System Preferences → Privacy & Security → Full Disk Access.
```bash
python examples/mail_reader_leann.py --query "What's the food I ordered by DoorDash or Uber Eats mostly?"
python examples/email_rag.py --query "What's the food I ordered by DoorDash or Uber Eats mostly?"
```
**780K email chunks → 78MB storage.** Finally, search your email like you search Google.
<details>
<summary><strong>📋 Click to expand: User Configurable Arguments</strong></summary>
#### Email-Specific Parameters
```bash
# Use default mail path (works for most macOS setups)
python examples/mail_reader_leann.py
# Auto-detect and process all Apple Mail accounts
python examples/email_rag.py
# Run with custom index directory
python examples/mail_reader_leann.py --index-dir "./my_mail_index"
# Process specific mail directory
python examples/email_rag.py --mail-path "~/Library/Mail/V10/..."
# Process all emails (may take time but indexes everything)
python examples/mail_reader_leann.py --max-emails -1
# Process all emails (may take time)
python examples/email_rag.py --max-items -1
# Limit number of emails processed (useful for testing)
python examples/mail_reader_leann.py --max-emails 1000
# Include HTML content
python examples/email_rag.py --include-html
# Run a single query
python examples/mail_reader_leann.py --query "What did my boss say about deadlines?"
# Use different embedding model
python examples/email_rag.py --embedding-model text-embedding-3-small --embedding-mode openai
```
</details>
@@ -250,25 +269,29 @@ Once the index is built, you can ask questions like:
</p>
```bash
python examples/google_history_reader_leann.py --query "Tell me my browser history about machine learning?"
python examples/browser_rag.py --query "Tell me my browser history about machine learning?"
```
**38K browser entries → 6MB storage.** Your browser history becomes your personal search engine.
<details>
<summary><strong>📋 Click to expand: User Configurable Arguments</strong></summary>
#### Browser-Specific Parameters
```bash
# Use default Chrome profile (auto-finds all profiles)
python examples/google_history_reader_leann.py
# Auto-detect and process all Chrome profiles
python examples/browser_rag.py
# Run with custom index directory
python examples/google_history_reader_leann.py --index-dir "./my_chrome_index"
# Process specific Chrome profile
python examples/browser_rag.py --chrome-profile "~/Library/Application Support/Google/Chrome/Default"
# Limit number of history entries processed (useful for testing)
python examples/google_history_reader_leann.py --max-entries 500
# Limit history entries for testing
python examples/browser_rag.py --max-items 500
# Run a single query
python examples/google_history_reader_leann.py --query "What websites did I visit about machine learning?"
# Interactive search mode
python examples/browser_rag.py # Without --query for interactive mode
# Use local LLM for privacy
python examples/browser_rag.py --llm ollama --llm-model llama3.2:1b
```
</details>
@@ -308,7 +331,7 @@ Once the index is built, you can ask questions like:
</p>
```bash
python examples/wechat_history_reader_leann.py --query "Show me all group chats about weekend plans"
python examples/wechat_rag.py --query "Show me all group chats about weekend plans"
```
**400K messages → 64MB storage** Search years of chat history in any language.
@@ -334,21 +357,22 @@ Failed to find or export WeChat data. Exiting.
<details>
<summary><strong>📋 Click to expand: User Configurable Arguments</strong></summary>
#### WeChat-Specific Parameters
```bash
# Use default settings (recommended for first run)
python examples/wechat_history_reader_leann.py
# Auto-export and index WeChat data
python examples/wechat_rag.py
# Run with custom export directory and wehn we run the first time, LEANN will export all chat history automatically for you
python examples/wechat_history_reader_leann.py --export-dir "./my_wechat_exports"
# Use custom export directory
python examples/wechat_rag.py --export-dir "./my_wechat_exports"
# Run with custom index directory
python examples/wechat_history_reader_leann.py --index-dir "./my_wechat_index"
# Force re-export even if data exists
python examples/wechat_rag.py --force-export
# Limit number of chat entries processed (useful for testing)
python examples/wechat_history_reader_leann.py --max-entries 1000
# Limit chat entries for testing
python examples/wechat_rag.py --max-items 1000
# Run a single query
python examples/wechat_history_reader_leann.py --query "Show me conversations about travel plans"
# Use HuggingFace model for Chinese support
python examples/wechat_rag.py --llm hf --llm-model Qwen/Qwen2.5-1.5B-Instruct
```
</details>