refactor: Unify examples interface with BaseRAGExample (#12)

* refactor: Unify examples interface with BaseRAGExample

- Create BaseRAGExample base class for all RAG examples
- Refactor 4 examples to use unified interface:
  - document_rag.py (replaces main_cli_example.py)
  - email_rag.py (replaces mail_reader_leann.py)
  - browser_rag.py (replaces google_history_reader_leann.py)
  - wechat_rag.py (replaces wechat_history_reader_leann.py)
- Maintain 100% parameter compatibility with original files
- Add interactive mode support for all examples
- Unify parameter names (--max-items replaces --max-emails/--max-entries)
- Update README.md with new examples usage
- Add PARAMETER_CONSISTENCY.md documenting all parameter mappings
- Keep main_cli_example.py for backward compatibility with migration notice

All default values, LeannBuilder parameters, and chunking settings
remain identical to ensure full compatibility with existing indexes.

* fix: Update CI tests for new unified examples interface

- Rename test_main_cli.py to test_document_rag.py
- Update all references from main_cli_example.py to document_rag.py
- Update tests/README.md documentation

The tests now properly test the new unified interface while maintaining
the same test coverage and functionality.

* fix: Fix pre-commit issues and update tests

- Fix import sorting and unused imports
- Update type annotations to use built-in types (list, dict) instead of typing.List/Dict
- Fix trailing whitespace and end-of-file issues
- Fix Chinese fullwidth comma to regular comma
- Update test_main_cli.py to test_document_rag.py
- Add backward compatibility test for main_cli_example.py
- Pass all pre-commit hooks (ruff, ruff-format, etc.)

* refactor: Remove old example scripts and migration references

- Delete old example scripts (mail_reader_leann.py, google_history_reader_leann.py, etc.)
- Remove migration hints and backward compatibility
- Update tests to use new unified examples directly
- Clean up all references to old script names
- Users now only see the new unified interface

* fix: Restore embedding-mode parameter to all examples

- All examples now have --embedding-mode parameter (unified interface benefit)
- Default is 'sentence-transformers' (consistent with original behavior)
- Users can now use OpenAI or MLX embeddings with any data source
- Maintains functional equivalence with original scripts

* docs: Improve parameter categorization in README

- Clearly separate core (shared) vs specific parameters
- Move LLM and embedding examples to 'Example Commands' section
- Add descriptive comments for all specific parameters
- Keep only truly data-source-specific parameters in specific sections

* docs: Make example commands more representative

- Add default values to parameter descriptions
- Replace generic examples with real-world use cases
- Focus on data-source-specific features in examples
- Remove redundant demonstrations of common parameters

* docs: Reorganize parameter documentation structure

- Move common parameters to a dedicated section before all examples
- Rename sections to 'X-Specific Arguments' for clarity
- Remove duplicate common parameters from individual examples
- Better information architecture for users

* docs: polish applications

* docs: Add CLI installation instructions

- Add two installation options: venv and global uv tool
- Clearly explain when to use each option
- Make CLI more accessible for daily use

* docs: Clarify CLI global installation process

- Explain the transition from venv to global installation
- Add upgrade command for global installation
- Make it clear that global install allows usage without venv activation

* docs: Add collapsible section for CLI installation

- Wrap CLI installation instructions in details/summary tags
- Keep consistent with other collapsible sections in README
- Improve document readability and navigation

* style: format

* docs: Fix collapsible sections

- Make Common Parameters collapsible (as it's lengthy reference material)
- Keep CLI Installation visible (important for users to see immediately)
- Better information hierarchy

* docs: Add introduction for Common Parameters section

- Add 'Flexible Configuration' heading with descriptive sentence
- Create parallel structure with 'Generation Model Setup' section
- Improve document flow and readability

* docs: nit

* fix: Fix issues in unified examples

- Add smart path detection for data directory
- Fix add_texts -> add_text method call
- Handle both running from project root and examples directory

* fix: Fix async/await and add_text issues in unified examples

- Remove incorrect await from chat.ask() calls (not async)
- Fix add_texts -> add_text method calls
- Verify search-complexity correctly maps to efSearch parameter
- All examples now run successfully

* feat: Address review comments

- Add complexity parameter to LeannChat initialization (default: search_complexity)
- Fix chunk-size default in README documentation (256, not 2048)
- Add more index building parameters as CLI arguments:
  - --backend-name (hnsw/diskann)
  - --graph-degree (default: 32)
  - --build-complexity (default: 64)
  - --no-compact (disable compact storage)
  - --no-recompute (disable embedding recomputation)
- Update README to document all new parameters

* feat: Add chunk-size parameters and improve file type filtering

- Add --chunk-size and --chunk-overlap parameters to all RAG examples
- Preserve original default values for each data source:
  - Document: 256/128 (optimized for general documents)
  - Email: 256/25 (smaller overlap for email threads)
  - Browser: 256/128 (standard for web content)
  - WeChat: 192/64 (smaller chunks for chat messages)
- Make --file-types optional filter instead of restriction in document_rag
- Update README to clarify interactive mode and parameter usage
- Fix LLM default model documentation (gpt-4o, not gpt-4o-mini)

* feat: Update documentation based on review feedback

- Add MLX embedding example to README
- Clarify examples/data content description (two papers, Pride and Prejudice, Chinese README)
- Move chunk parameters to common parameters section
- Remove duplicate chunk parameters from document-specific section

* docs: Emphasize diverse data sources in examples/data description

* fix: update default embedding models for better performance

- Change WeChat, Browser, and Email RAG examples to use all-MiniLM-L6-v2
- Previous Qwen/Qwen3-Embedding-0.6B was too slow for these use cases
- all-MiniLM-L6-v2 is a fast 384-dim model, ideal for large-scale personal data

* add response highlight

* change rebuild logic

* fix some example

* feat: check if k is larger than #docs

* fix: WeChat history reader bugs and refactor wechat_rag to use unified architecture

* fix email wrong -1 to process all file

* refactor: reorgnize all examples/ and test/

* refactor: reorganize examples and add link checker

* fix: add init.py

* fix: handle certificate errors in link checker

* fix wechat

* merge

* docs: update README to use proper module imports for apps

- Change from 'python apps/xxx.py' to 'python -m apps.xxx'
- More professional and pythonic module calling
- Ensures proper module resolution and imports
- Better separation between apps/ (production tools) and examples/ (demos)

---------

Co-authored-by: yichuan520030910320 <yichuan_wang@berkeley.edu>
This commit is contained in:
Andy Lee
2025-08-03 23:06:24 -07:00
committed by GitHub
parent 54df6310c5
commit 8899734952
50 changed files with 1293 additions and 3193 deletions

238
README.md
View File

@@ -41,40 +41,40 @@ LEANN achieves this through *graph-based selective recomputation* with *high-deg
## Installation
<details>
<summary><strong>📦 Prerequisites: Install uv (if you don't have it)</strong></summary>
### 📦 Prerequisites: Install uv
Install uv first if you don't have it:
[Install uv](https://docs.astral.sh/uv/getting-started/installation/#installation-methods) first if you don't have it. Typically, you can install it with:
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```
📖 [Detailed uv installation methods →](https://docs.astral.sh/uv/getting-started/installation/#installation-methods)
### 🚀 Quick Install
</details>
LEANN provides two installation methods: **pip install** (quick and easy) and **build from source** (recommended for development).
### 🚀 Quick Install (Recommended for most users)
Clone the repository to access all examples and install LEANN from [PyPI](https://pypi.org/project/leann/) to run them immediately:
Clone the repository to access all examples and try amazing applications,
```bash
git clone git@github.com:yichuan-w/LEANN.git leann
git clone https://github.com/yichuan-w/LEANN.git leann
cd leann
```
and install LEANN from [PyPI](https://pypi.org/project/leann/) to run them immediately:
```bash
uv venv
source .venv/bin/activate
uv pip install leann
```
### 🔧 Build from Source (Recommended for development)
<details>
<summary>
<strong>🔧 Build from Source (Recommended for development)</strong>
</summary>
```bash
git clone git@github.com:yichuan-w/LEANN.git leann
git clone https://github.com/yichuan-w/LEANN.git leann
cd leann
git submodule update --init --recursive
```
@@ -91,14 +91,14 @@ sudo apt-get install libomp-dev libboost-all-dev protobuf-compiler libabsl-dev l
uv sync
```
</details>
## Quick Start
Our declarative API makes RAG as easy as writing a config file.
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yichuan-w/LEANN/blob/main/demo.ipynb) [Try in this ipynb file →](demo.ipynb)
Check out [demo.ipynb](demo.ipynb) or [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yichuan-w/LEANN/blob/main/demo.ipynb)
```python
from leann import LeannBuilder, LeannSearcher, LeannChat
@@ -122,11 +122,11 @@ response = chat.ask("How much storage does LEANN save?", top_k=1)
## RAG on Everything!
LEANN supports RAG on various data sources including documents (.pdf, .txt, .md), Apple Mail, Google Search History, WeChat, and more.
LEANN supports RAG on various data sources including documents (`.pdf`, `.txt`, `.md`), Apple Mail, Google Search History, WeChat, and more.
### Generation Model Setup
> **Generation Model Setup**
> LEANN supports multiple LLM providers for text generation (OpenAI API, HuggingFace, Ollama).
LEANN supports multiple LLM providers for text generation (OpenAI API, HuggingFace, Ollama).
<details>
<summary><strong>🔑 OpenAI API Setup (Default)</strong></summary>
@@ -166,7 +166,49 @@ ollama pull llama3.2:1b
</details>
### 📄 Personal Data Manager: Process Any Documents (.pdf, .txt, .md)!
### Flexible Configuration
LEANN provides flexible parameters for embedding models, search strategies, and data processing to fit your specific needs.
<details>
<summary><strong>📋 Click to expand: Common Parameters (Available in All Examples)</strong></summary>
All RAG examples share these common parameters. **Interactive mode** is available in all examples - simply run without `--query` to start a continuous Q&A session where you can ask multiple questions. Type 'quit' to exit.
```bash
# Core Parameters (General preprocessing for all examples)
--index-dir DIR # Directory to store the index (default: current directory)
--query "YOUR QUESTION" # Single query mode. Omit for interactive chat (type 'quit' to exit), and now you can play with your index interactively
--max-items N # Limit data preprocessing (default: -1, process all data)
--force-rebuild # Force rebuild index even if it exists
# Embedding Parameters
--embedding-model MODEL # e.g., facebook/contriever, text-embedding-3-small or mlx-community/multilingual-e5-base-mlx
--embedding-mode MODE # sentence-transformers, openai, or mlx
# LLM Parameters (Text generation models)
--llm TYPE # LLM backend: openai, ollama, or hf (default: openai)
--llm-model MODEL # Model name (default: gpt-4o) e.g., gpt-4o-mini, llama3.2:1b, Qwen/Qwen2.5-1.5B-Instruct
# Search Parameters
--top-k N # Number of results to retrieve (default: 20)
--search-complexity N # Search complexity for graph traversal (default: 32)
# Chunking Parameters
--chunk-size N # Size of text chunks (default varies by source: 256 for most, 192 for WeChat)
--chunk-overlap N # Overlap between chunks (default varies: 25-128 depending on source)
# Index Building Parameters
--backend-name NAME # Backend to use: hnsw or diskann (default: hnsw)
--graph-degree N # Graph degree for index construction (default: 32)
--build-complexity N # Build complexity for index construction (default: 64)
--no-compact # Disable compact index storage (compact storage IS enabled to save storage by default)
--no-recompute # Disable embedding recomputation (recomputation IS enabled to save storage by default)
```
</details>
### 📄 Personal Data Manager: Process Any Documents (`.pdf`, `.txt`, `.md`)!
Ask questions directly about your personal PDFs, documents, and any directory containing your files!
@@ -174,25 +216,29 @@ Ask questions directly about your personal PDFs, documents, and any directory co
<img src="videos/paper_clear.gif" alt="LEANN Document Search Demo" width="600">
</p>
The example below asks a question about summarizing two papers (uses default data in `examples/data`) and this is the easiest example to run here:
The example below asks a question about summarizing our paper (uses default data in `data/`, which is a directory with diverse data sources: two papers, Pride and Prejudice, and a README in Chinese) and this is the **easiest example** to run here:
```bash
source .venv/bin/activate
python ./examples/main_cli_example.py
source .venv/bin/activate # Don't forget to activate the virtual environment
python -m apps.document_rag --query "What are the main techniques LEANN explores?"
```
<details>
<summary><strong>📋 Click to expand: User Configurable Arguments</strong></summary>
<summary><strong>📋 Click to expand: Document-Specific Arguments</strong></summary>
#### Parameters
```bash
# Use custom index directory
python examples/main_cli_example.py --index-dir "./my_custom_index"
--data-dir DIR # Directory containing documents to process (default: data)
--file-types .ext .ext # Filter by specific file types (optional - all LlamaIndex supported types if omitted)
```
# Use custom data directory
python examples/main_cli_example.py --data-dir "./my_documents"
#### Example Commands
```bash
# Process all documents with larger chunks for academic papers
python -m apps.document_rag --data-dir "~/Documents/Papers" --chunk-size 1024
# Ask a specific question
python examples/main_cli_example.py --query "What are the main findings in these papers?"
# Filter only markdown and Python files with smaller chunks
python -m apps.document_rag --data-dir "./docs" --chunk-size 256 --file-types .md .py
```
</details>
@@ -206,30 +252,29 @@ python examples/main_cli_example.py --query "What are the main findings in these
<img src="videos/mail_clear.gif" alt="LEANN Email Search Demo" width="600">
</p>
**Note:** You need to grant full disk access to your terminal/VS Code in System Preferences → Privacy & Security → Full Disk Access.
Before running the example below, you need to grant full disk access to your terminal/VS Code in System Preferences → Privacy & Security → Full Disk Access.
```bash
python examples/mail_reader_leann.py --query "What's the food I ordered by DoorDash or Uber Eats mostly?"
python -m apps.email_rag --query "What's the food I ordered by DoorDash or Uber Eats mostly?"
```
**780K email chunks → 78MB storage.** Finally, search your email like you search Google.
<details>
<summary><strong>📋 Click to expand: User Configurable Arguments</strong></summary>
<summary><strong>📋 Click to expand: Email-Specific Arguments</strong></summary>
#### Parameters
```bash
# Use default mail path (works for most macOS setups)
python examples/mail_reader_leann.py
--mail-path PATH # Path to specific mail directory (auto-detects if omitted)
--include-html # Include HTML content in processing (useful for newsletters)
```
# Run with custom index directory
python examples/mail_reader_leann.py --index-dir "./my_mail_index"
#### Example Commands
```bash
# Search work emails from a specific account
python -m apps.email_rag --mail-path "~/Library/Mail/V10/WORK_ACCOUNT"
# Process all emails (may take time but indexes everything)
python examples/mail_reader_leann.py --max-emails -1
# Limit number of emails processed (useful for testing)
python examples/mail_reader_leann.py --max-emails 1000
# Run a single query
python examples/mail_reader_leann.py --query "What did my boss say about deadlines?"
# Find all receipts and order confirmations (includes HTML)
python -m apps.email_rag --query "receipt order confirmation invoice" --include-html
```
</details>
@@ -250,25 +295,25 @@ Once the index is built, you can ask questions like:
</p>
```bash
python examples/google_history_reader_leann.py --query "Tell me my browser history about machine learning?"
python -m apps.browser_rag --query "Tell me my browser history about machine learning?"
```
**38K browser entries → 6MB storage.** Your browser history becomes your personal search engine.
<details>
<summary><strong>📋 Click to expand: User Configurable Arguments</strong></summary>
<summary><strong>📋 Click to expand: Browser-Specific Arguments</strong></summary>
#### Parameters
```bash
# Use default Chrome profile (auto-finds all profiles)
python examples/google_history_reader_leann.py
--chrome-profile PATH # Path to Chrome profile directory (auto-detects if omitted)
```
# Run with custom index directory
python examples/google_history_reader_leann.py --index-dir "./my_chrome_index"
#### Example Commands
```bash
# Search academic research from your browsing history
python -m apps.browser_rag --query "arxiv papers machine learning transformer architecture"
# Limit number of history entries processed (useful for testing)
python examples/google_history_reader_leann.py --max-entries 500
# Run a single query
python examples/google_history_reader_leann.py --query "What websites did I visit about machine learning?"
# Track competitor analysis across work profile
python -m apps.browser_rag --chrome-profile "~/Library/Application Support/Google/Chrome/Work Profile" --max-items 5000
```
</details>
@@ -308,7 +353,7 @@ Once the index is built, you can ask questions like:
</p>
```bash
python examples/wechat_history_reader_leann.py --query "Show me all group chats about weekend plans"
python -m apps.wechat_rag --query "Show me all group chats about weekend plans"
```
**400K messages → 64MB storage** Search years of chat history in any language.
@@ -316,7 +361,13 @@ python examples/wechat_history_reader_leann.py --query "Show me all group chats
<details>
<summary><strong>🔧 Click to expand: Installation Requirements</strong></summary>
First, you need to install the WeChat exporter:
First, you need to install the [WeChat exporter](https://github.com/sunnyyoung/WeChatTweak-CLI),
```bash
brew install sunnyyoung/repo/wechattweak-cli
```
or install it manually (if you have issues with Homebrew):
```bash
sudo packages/wechat-exporter/wechattweak-cli install
@@ -325,30 +376,28 @@ sudo packages/wechat-exporter/wechattweak-cli install
**Troubleshooting:**
- **Installation issues**: Check the [WeChatTweak-CLI issues page](https://github.com/sunnyyoung/WeChatTweak-CLI/issues/41)
- **Export errors**: If you encounter the error below, try restarting WeChat
```
Failed to export WeChat data. Please ensure WeChat is running and WeChatTweak is installed.
Failed to find or export WeChat data. Exiting.
```
```bash
Failed to export WeChat data. Please ensure WeChat is running and WeChatTweak is installed.
Failed to find or export WeChat data. Exiting.
```
</details>
<details>
<summary><strong>📋 Click to expand: User Configurable Arguments</strong></summary>
<summary><strong>📋 Click to expand: WeChat-Specific Arguments</strong></summary>
#### Parameters
```bash
# Use default settings (recommended for first run)
python examples/wechat_history_reader_leann.py
--export-dir DIR # Directory to store exported WeChat data (default: wechat_export_direct)
--force-export # Force re-export even if data exists
```
# Run with custom export directory and wehn we run the first time, LEANN will export all chat history automatically for you
python examples/wechat_history_reader_leann.py --export-dir "./my_wechat_exports"
#### Example Commands
```bash
# Search for travel plans discussed in group chats
python -m apps.wechat_rag --query "travel plans" --max-items 10000
# Run with custom index directory
python examples/wechat_history_reader_leann.py --index-dir "./my_wechat_index"
# Limit number of chat entries processed (useful for testing)
python examples/wechat_history_reader_leann.py --max-entries 1000
# Run a single query
python examples/wechat_history_reader_leann.py --query "Show me conversations about travel plans"
# Re-export and search recent chats (useful after new messages)
python -m apps.wechat_rag --force-export --query "work schedule"
```
</details>
@@ -368,6 +417,27 @@ Once the index is built, you can ask questions like:
LEANN includes a powerful CLI for document processing and search. Perfect for quick document indexing and interactive chat.
### Installation
If you followed the Quick Start, `leann` is already installed in your virtual environment:
```bash
source .venv/bin/activate
leann --help
```
**To make it globally available (recommended for daily use):**
```bash
# Install the LEANN CLI globally using uv tool
uv tool install leann
# Now you can use leann from anywhere without activating venv
leann --help
```
### Usage Examples
```bash
# Build an index from documents
leann build my-docs --docs ./documents
@@ -449,8 +519,8 @@ Options:
## Benchmarks
📊 **[Simple Example: Compare LEANN vs FAISS →](examples/compare_faiss_vs_leann.py)**
### Storage Comparison
**[Simple Example: Compare LEANN vs FAISS →](benchmarks/compare_faiss_vs_leann.py)**
### 📊 Storage Comparison
| System | DPR (2.1M) | Wiki (60M) | Chat (400K) | Email (780K) | Browser (38K) |
|--------|-------------|------------|-------------|--------------|---------------|
@@ -464,8 +534,8 @@ Options:
```bash
uv pip install -e ".[dev]" # Install dev dependencies
python examples/run_evaluation.py data/indices/dpr/dpr_diskann # DPR dataset
python examples/run_evaluation.py data/indices/rpj_wiki/rpj_wiki.index # Wikipedia
python benchmarks/run_evaluation.py data/indices/dpr/dpr_diskann # DPR dataset
python benchmarks/run_evaluation.py data/indices/rpj_wiki/rpj_wiki.index # Wikipedia
```
The evaluation script downloads data automatically on first run. The last three results were tested with partial personal data, and you can reproduce them with your own data!
@@ -503,7 +573,9 @@ MIT License - see [LICENSE](LICENSE) for details.
## 🙏 Acknowledgments
This work is done at [**Berkeley Sky Computing Lab**](https://sky.cs.berkeley.edu/)
This work is done at [**Berkeley Sky Computing Lab**](https://sky.cs.berkeley.edu/).
---
<p align="center">