refactor: Unify examples interface with BaseRAGExample (#12)

* refactor: Unify examples interface with BaseRAGExample - Create BaseRAGExample base class for all RAG examples - Refactor 4 examples to use unified interface: - document_rag.py (replaces main_cli_example.py) - email_rag.py (replaces mail_reader_leann.py) - browser_rag.py (replaces google_history_reader_leann.py) - wechat_rag.py (replaces wechat_history_reader_leann.py) - Maintain 100% parameter compatibility with original files - Add interactive mode support for all examples - Unify parameter names (--max-items replaces --max-emails/--max-entries) - Update README.md with new examples usage - Add PARAMETER_CONSISTENCY.md documenting all parameter mappings - Keep main_cli_example.py for backward compatibility with migration notice All default values, LeannBuilder parameters, and chunking settings remain identical to ensure full compatibility with existing indexes. * fix: Update CI tests for new unified examples interface - Rename test_main_cli.py to test_document_rag.py - Update all references from main_cli_example.py to document_rag.py - Update tests/README.md documentation The tests now properly test the new unified interface while maintaining the same test coverage and functionality. * fix: Fix pre-commit issues and update tests - Fix import sorting and unused imports - Update type annotations to use built-in types (list, dict) instead of typing.List/Dict - Fix trailing whitespace and end-of-file issues - Fix Chinese fullwidth comma to regular comma - Update test_main_cli.py to test_document_rag.py - Add backward compatibility test for main_cli_example.py - Pass all pre-commit hooks (ruff, ruff-format, etc.) * refactor: Remove old example scripts and migration references - Delete old example scripts (mail_reader_leann.py, google_history_reader_leann.py, etc.) - Remove migration hints and backward compatibility - Update tests to use new unified examples directly - Clean up all references to old script names - Users now only see the new unified interface * fix: Restore embedding-mode parameter to all examples - All examples now have --embedding-mode parameter (unified interface benefit) - Default is 'sentence-transformers' (consistent with original behavior) - Users can now use OpenAI or MLX embeddings with any data source - Maintains functional equivalence with original scripts * docs: Improve parameter categorization in README - Clearly separate core (shared) vs specific parameters - Move LLM and embedding examples to 'Example Commands' section - Add descriptive comments for all specific parameters - Keep only truly data-source-specific parameters in specific sections * docs: Make example commands more representative - Add default values to parameter descriptions - Replace generic examples with real-world use cases - Focus on data-source-specific features in examples - Remove redundant demonstrations of common parameters * docs: Reorganize parameter documentation structure - Move common parameters to a dedicated section before all examples - Rename sections to 'X-Specific Arguments' for clarity - Remove duplicate common parameters from individual examples - Better information architecture for users * docs: polish applications * docs: Add CLI installation instructions - Add two installation options: venv and global uv tool - Clearly explain when to use each option - Make CLI more accessible for daily use * docs: Clarify CLI global installation process - Explain the transition from venv to global installation - Add upgrade command for global installation - Make it clear that global install allows usage without venv activation * docs: Add collapsible section for CLI installation - Wrap CLI installation instructions in details/summary tags - Keep consistent with other collapsible sections in README - Improve document readability and navigation * style: format * docs: Fix collapsible sections - Make Common Parameters collapsible (as it's lengthy reference material) - Keep CLI Installation visible (important for users to see immediately) - Better information hierarchy * docs: Add introduction for Common Parameters section - Add 'Flexible Configuration' heading with descriptive sentence - Create parallel structure with 'Generation Model Setup' section - Improve document flow and readability * docs: nit * fix: Fix issues in unified examples - Add smart path detection for data directory - Fix add_texts -> add_text method call - Handle both running from project root and examples directory * fix: Fix async/await and add_text issues in unified examples - Remove incorrect await from chat.ask() calls (not async) - Fix add_texts -> add_text method calls - Verify search-complexity correctly maps to efSearch parameter - All examples now run successfully * feat: Address review comments - Add complexity parameter to LeannChat initialization (default: search_complexity) - Fix chunk-size default in README documentation (256, not 2048) - Add more index building parameters as CLI arguments: - --backend-name (hnsw/diskann) - --graph-degree (default: 32) - --build-complexity (default: 64) - --no-compact (disable compact storage) - --no-recompute (disable embedding recomputation) - Update README to document all new parameters * feat: Add chunk-size parameters and improve file type filtering - Add --chunk-size and --chunk-overlap parameters to all RAG examples - Preserve original default values for each data source: - Document: 256/128 (optimized for general documents) - Email: 256/25 (smaller overlap for email threads) - Browser: 256/128 (standard for web content) - WeChat: 192/64 (smaller chunks for chat messages) - Make --file-types optional filter instead of restriction in document_rag - Update README to clarify interactive mode and parameter usage - Fix LLM default model documentation (gpt-4o, not gpt-4o-mini) * feat: Update documentation based on review feedback - Add MLX embedding example to README - Clarify examples/data content description (two papers, Pride and Prejudice, Chinese README) - Move chunk parameters to common parameters section - Remove duplicate chunk parameters from document-specific section * docs: Emphasize diverse data sources in examples/data description * fix: update default embedding models for better performance - Change WeChat, Browser, and Email RAG examples to use all-MiniLM-L6-v2 - Previous Qwen/Qwen3-Embedding-0.6B was too slow for these use cases - all-MiniLM-L6-v2 is a fast 384-dim model, ideal for large-scale personal data * add response highlight * change rebuild logic * fix some example * feat: check if k is larger than #docs * fix: WeChat history reader bugs and refactor wechat_rag to use unified architecture * fix email wrong -1 to process all file * refactor: reorgnize all examples/ and test/ * refactor: reorganize examples and add link checker * fix: add init.py * fix: handle certificate errors in link checker * fix wechat * merge * docs: update README to use proper module imports for apps - Change from 'python apps/xxx.py' to 'python -m apps.xxx' - More professional and pythonic module calling - Ensures proper module resolution and imports - Better separation between apps/ (production tools) and examples/ (demos) --------- Co-authored-by: yichuan520030910320 <yichuan_wang@berkeley.edu>
2025-08-03 23:06:24 -07:00
parent 54df6310c5
commit 8899734952
50 changed files with 1293 additions and 3193 deletions
--- a/README.md
+++ b/README.md
@@ -41,40 +41,40 @@ LEANN achieves this through *graph-based selective recomputation* with *high-deg

 ## Installation

-<details>
-<summary><strong>📦 Prerequisites: Install uv (if you don't have it)</strong></summary>
+### 📦 Prerequisites: Install uv

-Install uv first if you don't have it:
+[Install uv](https://docs.astral.sh/uv/getting-started/installation/#installation-methods) first if you don't have it. Typically, you can install it with:

 ```bash
 curl -LsSf https://astral.sh/uv/install.sh | sh
 ```

-📖 [Detailed uv installation methods →](https://docs.astral.sh/uv/getting-started/installation/#installation-methods)
+### 🚀 Quick Install

-</details>
-
-
-LEANN provides two installation methods: **pip install** (quick and easy) and **build from source** (recommended for development).
-
-
-
-### 🚀 Quick Install (Recommended for most users)
-
-Clone the repository to access all examples and install LEANN from [PyPI](https://pypi.org/project/leann/) to run them immediately:
+Clone the repository to access all examples and try amazing applications,

 ```bash
-git clone git@github.com:yichuan-w/LEANN.git leann
+git clone https://github.com/yichuan-w/LEANN.git leann
 cd leann
+```
+
+and install LEANN from [PyPI](https://pypi.org/project/leann/) to run them immediately:
+
+```bash
 uv venv
 source .venv/bin/activate
 uv pip install leann
 ```

-### 🔧 Build from Source (Recommended for development)
+<details>
+<summary>
+<strong>🔧 Build from Source (Recommended for development)</strong>
+</summary>
+
+

 ```bash
-git clone git@github.com:yichuan-w/LEANN.git leann
+git clone https://github.com/yichuan-w/LEANN.git leann
 cd leann
 git submodule update --init --recursive
 ```
@@ -91,14 +91,14 @@ sudo apt-get install libomp-dev libboost-all-dev protobuf-compiler libabsl-dev l
 uv sync
 ```

-
+</details>


 ## Quick Start

 Our declarative API makes RAG as easy as writing a config file.

-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yichuan-w/LEANN/blob/main/demo.ipynb) [Try in this ipynb file →](demo.ipynb)
+Check out [demo.ipynb](demo.ipynb) or [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yichuan-w/LEANN/blob/main/demo.ipynb)

 ```python
 from leann import LeannBuilder, LeannSearcher, LeannChat
@@ -122,11 +122,11 @@ response = chat.ask("How much storage does LEANN save?", top_k=1)

 ## RAG on Everything!

-LEANN supports RAG on various data sources including documents (.pdf, .txt, .md), Apple Mail, Google Search History, WeChat, and more.
+LEANN supports RAG on various data sources including documents (`.pdf`, `.txt`, `.md`), Apple Mail, Google Search History, WeChat, and more.

+### Generation Model Setup

-> **Generation Model Setup**
-> LEANN supports multiple LLM providers for text generation (OpenAI API, HuggingFace, Ollama).
+LEANN supports multiple LLM providers for text generation (OpenAI API, HuggingFace, Ollama).

 <details>
 <summary><strong>🔑 OpenAI API Setup (Default)</strong></summary>
@@ -166,7 +166,49 @@ ollama pull llama3.2:1b

 </details>

-### 📄 Personal Data Manager: Process Any Documents (.pdf, .txt, .md)!
+### Flexible Configuration
+
+LEANN provides flexible parameters for embedding models, search strategies, and data processing to fit your specific needs.
+
+<details>
+<summary><strong>📋 Click to expand: Common Parameters (Available in All Examples)</strong></summary>
+
+All RAG examples share these common parameters. **Interactive mode** is available in all examples - simply run without `--query` to start a continuous Q&A session where you can ask multiple questions. Type 'quit' to exit.
+
+```bash
+# Core Parameters (General preprocessing for all examples)
+--index-dir DIR          # Directory to store the index (default: current directory)
+--query "YOUR QUESTION"  # Single query mode. Omit for interactive chat (type 'quit' to exit), and now you can play with your index interactively
+--max-items N           # Limit data preprocessing (default: -1, process all data)
+--force-rebuild         # Force rebuild index even if it exists
+
+# Embedding Parameters
+--embedding-model MODEL  # e.g., facebook/contriever, text-embedding-3-small or mlx-community/multilingual-e5-base-mlx
+--embedding-mode MODE    # sentence-transformers, openai, or mlx
+
+# LLM Parameters (Text generation models)
+--llm TYPE              # LLM backend: openai, ollama, or hf (default: openai)
+--llm-model MODEL       # Model name (default: gpt-4o) e.g., gpt-4o-mini, llama3.2:1b, Qwen/Qwen2.5-1.5B-Instruct
+
+# Search Parameters
+--top-k N               # Number of results to retrieve (default: 20)
+--search-complexity N   # Search complexity for graph traversal (default: 32)
+
+# Chunking Parameters
+--chunk-size N          # Size of text chunks (default varies by source: 256 for most, 192 for WeChat)
+--chunk-overlap N       # Overlap between chunks (default varies: 25-128 depending on source)
+
+# Index Building Parameters
+--backend-name NAME     # Backend to use: hnsw or diskann (default: hnsw)
+--graph-degree N        # Graph degree for index construction (default: 32)
+--build-complexity N    # Build complexity for index construction (default: 64)
+--no-compact           # Disable compact index storage (compact storage IS enabled to save storage by default)
+--no-recompute         # Disable embedding recomputation (recomputation IS enabled to save storage by default)
+```
+
+</details>
+
+### 📄 Personal Data Manager: Process Any Documents (`.pdf`, `.txt`, `.md`)!

 Ask questions directly about your personal PDFs, documents, and any directory containing your files!

@@ -174,25 +216,29 @@ Ask questions directly about your personal PDFs, documents, and any directory co
  <img src="videos/paper_clear.gif" alt="LEANN Document Search Demo" width="600">
 </p>

-The example below asks a question about summarizing two papers (uses default data in `examples/data`) and this is the easiest example to run here:
+The example below asks a question about summarizing our paper (uses default data in `data/`, which is a directory with diverse data sources: two papers, Pride and Prejudice, and a README in Chinese) and this is the **easiest example** to run here:

 ```bash
-source .venv/bin/activate
-python ./examples/main_cli_example.py
+source .venv/bin/activate # Don't forget to activate the virtual environment
+python -m apps.document_rag --query "What are the main techniques LEANN explores?"
 ```

 <details>
-<summary><strong>📋 Click to expand: User Configurable Arguments</strong></summary>
+<summary><strong>📋 Click to expand: Document-Specific Arguments</strong></summary>

+#### Parameters
 ```bash
-# Use custom index directory
-python examples/main_cli_example.py --index-dir "./my_custom_index"
+--data-dir DIR           # Directory containing documents to process (default: data)
+--file-types .ext .ext   # Filter by specific file types (optional - all LlamaIndex supported types if omitted)
+```

-# Use custom data directory
-python examples/main_cli_example.py --data-dir "./my_documents"
+#### Example Commands
+```bash
+# Process all documents with larger chunks for academic papers
+python -m apps.document_rag --data-dir "~/Documents/Papers" --chunk-size 1024

-# Ask a specific question
-python examples/main_cli_example.py --query "What are the main findings in these papers?"
+# Filter only markdown and Python files with smaller chunks
+python -m apps.document_rag --data-dir "./docs" --chunk-size 256 --file-types .md .py
 ```

 </details>
@@ -206,30 +252,29 @@ python examples/main_cli_example.py --query "What are the main findings in these
  <img src="videos/mail_clear.gif" alt="LEANN Email Search Demo" width="600">
 </p>

-**Note:** You need to grant full disk access to your terminal/VS Code in System Preferences → Privacy & Security → Full Disk Access.
+Before running the example below, you need to grant full disk access to your terminal/VS Code in System Preferences → Privacy & Security → Full Disk Access.
+
 ```bash
-python examples/mail_reader_leann.py --query "What's the food I ordered by DoorDash or Uber Eats mostly?"
+python -m apps.email_rag --query "What's the food I ordered by DoorDash or Uber Eats mostly?"
 ```
 **780K email chunks → 78MB storage.** Finally, search your email like you search Google.

 <details>
-<summary><strong>📋 Click to expand: User Configurable Arguments</strong></summary>
+<summary><strong>📋 Click to expand: Email-Specific Arguments</strong></summary>

+#### Parameters
 ```bash
-# Use default mail path (works for most macOS setups)
-python examples/mail_reader_leann.py
+--mail-path PATH         # Path to specific mail directory (auto-detects if omitted)
+--include-html          # Include HTML content in processing (useful for newsletters)
+```

-# Run with custom index directory
-python examples/mail_reader_leann.py --index-dir "./my_mail_index"
+#### Example Commands
+```bash
+# Search work emails from a specific account
+python -m apps.email_rag --mail-path "~/Library/Mail/V10/WORK_ACCOUNT"

-# Process all emails (may take time but indexes everything)
-python examples/mail_reader_leann.py --max-emails -1
-
-# Limit number of emails processed (useful for testing)
-python examples/mail_reader_leann.py --max-emails 1000
-
-# Run a single query
-python examples/mail_reader_leann.py --query "What did my boss say about deadlines?"
+# Find all receipts and order confirmations (includes HTML)
+python -m apps.email_rag --query "receipt order confirmation invoice" --include-html
 ```

 </details>
@@ -250,25 +295,25 @@ Once the index is built, you can ask questions like:
 </p>

 ```bash
-python examples/google_history_reader_leann.py --query "Tell me my browser history about machine learning?"
+python -m apps.browser_rag --query "Tell me my browser history about machine learning?"
 ```
 **38K browser entries → 6MB storage.** Your browser history becomes your personal search engine.

 <details>
-<summary><strong>📋 Click to expand: User Configurable Arguments</strong></summary>
+<summary><strong>📋 Click to expand: Browser-Specific Arguments</strong></summary>

+#### Parameters
 ```bash
-# Use default Chrome profile (auto-finds all profiles)
-python examples/google_history_reader_leann.py
+--chrome-profile PATH    # Path to Chrome profile directory (auto-detects if omitted)
+```

-# Run with custom index directory
-python examples/google_history_reader_leann.py --index-dir "./my_chrome_index"
+#### Example Commands
+```bash
+# Search academic research from your browsing history
+python -m apps.browser_rag --query "arxiv papers machine learning transformer architecture"

-# Limit number of history entries processed (useful for testing)
-python examples/google_history_reader_leann.py --max-entries 500
-
-# Run a single query
-python examples/google_history_reader_leann.py --query "What websites did I visit about machine learning?"
+# Track competitor analysis across work profile
+python -m apps.browser_rag --chrome-profile "~/Library/Application Support/Google/Chrome/Work Profile" --max-items 5000
 ```

 </details>
@@ -308,7 +353,7 @@ Once the index is built, you can ask questions like:
 </p>

 ```bash
-python examples/wechat_history_reader_leann.py --query "Show me all group chats about weekend plans"
+python -m apps.wechat_rag --query "Show me all group chats about weekend plans"
 ```
 **400K messages → 64MB storage** Search years of chat history in any language.

@@ -316,7 +361,13 @@ python examples/wechat_history_reader_leann.py --query "Show me all group chats
 <details>
 <summary><strong>🔧 Click to expand: Installation Requirements</strong></summary>

-First, you need to install the WeChat exporter:
+First, you need to install the [WeChat exporter](https://github.com/sunnyyoung/WeChatTweak-CLI),
+
+```bash
+brew install sunnyyoung/repo/wechattweak-cli
+```
+
+or install it manually (if you have issues with Homebrew):

 ```bash
 sudo packages/wechat-exporter/wechattweak-cli install
@@ -325,30 +376,28 @@ sudo packages/wechat-exporter/wechattweak-cli install
 **Troubleshooting:**
 - **Installation issues**: Check the [WeChatTweak-CLI issues page](https://github.com/sunnyyoung/WeChatTweak-CLI/issues/41)
 - **Export errors**: If you encounter the error below, try restarting WeChat
-```
-Failed to export WeChat data. Please ensure WeChat is running and WeChatTweak is installed.
-Failed to find or export WeChat data. Exiting.
-```
+  ```bash
+  Failed to export WeChat data. Please ensure WeChat is running and WeChatTweak is installed.
+  Failed to find or export WeChat data. Exiting.
+  ```
 </details>

 <details>
-<summary><strong>📋 Click to expand: User Configurable Arguments</strong></summary>
+<summary><strong>📋 Click to expand: WeChat-Specific Arguments</strong></summary>

+#### Parameters
 ```bash
-# Use default settings (recommended for first run)
-python examples/wechat_history_reader_leann.py
+--export-dir DIR         # Directory to store exported WeChat data (default: wechat_export_direct)
+--force-export          # Force re-export even if data exists
+```

-# Run with custom export directory and wehn we run the first time, LEANN will export all chat history automatically for you
-python examples/wechat_history_reader_leann.py --export-dir "./my_wechat_exports"
+#### Example Commands
+```bash
+# Search for travel plans discussed in group chats
+python -m apps.wechat_rag --query "travel plans" --max-items 10000

-# Run with custom index directory
-python examples/wechat_history_reader_leann.py --index-dir "./my_wechat_index"
-
-# Limit number of chat entries processed (useful for testing)
-python examples/wechat_history_reader_leann.py --max-entries 1000
-
-# Run a single query
-python examples/wechat_history_reader_leann.py --query "Show me conversations about travel plans"
+# Re-export and search recent chats (useful after new messages)
+python -m apps.wechat_rag --force-export --query "work schedule"
 ```

 </details>
@@ -368,6 +417,27 @@ Once the index is built, you can ask questions like:

 LEANN includes a powerful CLI for document processing and search. Perfect for quick document indexing and interactive chat.

+### Installation
+
+If you followed the Quick Start, `leann` is already installed in your virtual environment:
+```bash
+source .venv/bin/activate
+leann --help
+```
+
+**To make it globally available (recommended for daily use):**
+```bash
+# Install the LEANN CLI globally using uv tool
+uv tool install leann
+
+# Now you can use leann from anywhere without activating venv
+leann --help
+```
+
+
+
+### Usage Examples
+
 ```bash
 # Build an index from documents
 leann build my-docs --docs ./documents
@@ -449,8 +519,8 @@ Options:
 ## Benchmarks


-📊 **[Simple Example: Compare LEANN vs FAISS →](examples/compare_faiss_vs_leann.py)**
-### Storage Comparison
+**[Simple Example: Compare LEANN vs FAISS →](benchmarks/compare_faiss_vs_leann.py)**
+### 📊 Storage Comparison

 | System | DPR (2.1M) | Wiki (60M) | Chat (400K) | Email (780K) | Browser (38K) |
 |--------|-------------|------------|-------------|--------------|---------------|
@@ -464,8 +534,8 @@ Options:

 ```bash
 uv pip install -e ".[dev]"  # Install dev dependencies
-python examples/run_evaluation.py data/indices/dpr/dpr_diskann      # DPR dataset
-python examples/run_evaluation.py data/indices/rpj_wiki/rpj_wiki.index  # Wikipedia
+python benchmarks/run_evaluation.py data/indices/dpr/dpr_diskann      # DPR dataset
+python benchmarks/run_evaluation.py data/indices/rpj_wiki/rpj_wiki.index  # Wikipedia
 ```

 The evaluation script downloads data automatically on first run. The last three results were tested with partial personal data, and you can reproduce them with your own data!
@@ -503,7 +573,9 @@ MIT License - see [LICENSE](LICENSE) for details.

 ## 🙏 Acknowledgments

-This work is done at [**Berkeley Sky Computing Lab**](https://sky.cs.berkeley.edu/)
+This work is done at [**Berkeley Sky Computing Lab**](https://sky.cs.berkeley.edu/).
+
+
 ---

 <p align="center">