Files
LEANN/tests/test_document_rag.py
Andy Lee 8899734952 refactor: Unify examples interface with BaseRAGExample (#12)
* refactor: Unify examples interface with BaseRAGExample

- Create BaseRAGExample base class for all RAG examples
- Refactor 4 examples to use unified interface:
  - document_rag.py (replaces main_cli_example.py)
  - email_rag.py (replaces mail_reader_leann.py)
  - browser_rag.py (replaces google_history_reader_leann.py)
  - wechat_rag.py (replaces wechat_history_reader_leann.py)
- Maintain 100% parameter compatibility with original files
- Add interactive mode support for all examples
- Unify parameter names (--max-items replaces --max-emails/--max-entries)
- Update README.md with new examples usage
- Add PARAMETER_CONSISTENCY.md documenting all parameter mappings
- Keep main_cli_example.py for backward compatibility with migration notice

All default values, LeannBuilder parameters, and chunking settings
remain identical to ensure full compatibility with existing indexes.

* fix: Update CI tests for new unified examples interface

- Rename test_main_cli.py to test_document_rag.py
- Update all references from main_cli_example.py to document_rag.py
- Update tests/README.md documentation

The tests now properly test the new unified interface while maintaining
the same test coverage and functionality.

* fix: Fix pre-commit issues and update tests

- Fix import sorting and unused imports
- Update type annotations to use built-in types (list, dict) instead of typing.List/Dict
- Fix trailing whitespace and end-of-file issues
- Fix Chinese fullwidth comma to regular comma
- Update test_main_cli.py to test_document_rag.py
- Add backward compatibility test for main_cli_example.py
- Pass all pre-commit hooks (ruff, ruff-format, etc.)

* refactor: Remove old example scripts and migration references

- Delete old example scripts (mail_reader_leann.py, google_history_reader_leann.py, etc.)
- Remove migration hints and backward compatibility
- Update tests to use new unified examples directly
- Clean up all references to old script names
- Users now only see the new unified interface

* fix: Restore embedding-mode parameter to all examples

- All examples now have --embedding-mode parameter (unified interface benefit)
- Default is 'sentence-transformers' (consistent with original behavior)
- Users can now use OpenAI or MLX embeddings with any data source
- Maintains functional equivalence with original scripts

* docs: Improve parameter categorization in README

- Clearly separate core (shared) vs specific parameters
- Move LLM and embedding examples to 'Example Commands' section
- Add descriptive comments for all specific parameters
- Keep only truly data-source-specific parameters in specific sections

* docs: Make example commands more representative

- Add default values to parameter descriptions
- Replace generic examples with real-world use cases
- Focus on data-source-specific features in examples
- Remove redundant demonstrations of common parameters

* docs: Reorganize parameter documentation structure

- Move common parameters to a dedicated section before all examples
- Rename sections to 'X-Specific Arguments' for clarity
- Remove duplicate common parameters from individual examples
- Better information architecture for users

* docs: polish applications

* docs: Add CLI installation instructions

- Add two installation options: venv and global uv tool
- Clearly explain when to use each option
- Make CLI more accessible for daily use

* docs: Clarify CLI global installation process

- Explain the transition from venv to global installation
- Add upgrade command for global installation
- Make it clear that global install allows usage without venv activation

* docs: Add collapsible section for CLI installation

- Wrap CLI installation instructions in details/summary tags
- Keep consistent with other collapsible sections in README
- Improve document readability and navigation

* style: format

* docs: Fix collapsible sections

- Make Common Parameters collapsible (as it's lengthy reference material)
- Keep CLI Installation visible (important for users to see immediately)
- Better information hierarchy

* docs: Add introduction for Common Parameters section

- Add 'Flexible Configuration' heading with descriptive sentence
- Create parallel structure with 'Generation Model Setup' section
- Improve document flow and readability

* docs: nit

* fix: Fix issues in unified examples

- Add smart path detection for data directory
- Fix add_texts -> add_text method call
- Handle both running from project root and examples directory

* fix: Fix async/await and add_text issues in unified examples

- Remove incorrect await from chat.ask() calls (not async)
- Fix add_texts -> add_text method calls
- Verify search-complexity correctly maps to efSearch parameter
- All examples now run successfully

* feat: Address review comments

- Add complexity parameter to LeannChat initialization (default: search_complexity)
- Fix chunk-size default in README documentation (256, not 2048)
- Add more index building parameters as CLI arguments:
  - --backend-name (hnsw/diskann)
  - --graph-degree (default: 32)
  - --build-complexity (default: 64)
  - --no-compact (disable compact storage)
  - --no-recompute (disable embedding recomputation)
- Update README to document all new parameters

* feat: Add chunk-size parameters and improve file type filtering

- Add --chunk-size and --chunk-overlap parameters to all RAG examples
- Preserve original default values for each data source:
  - Document: 256/128 (optimized for general documents)
  - Email: 256/25 (smaller overlap for email threads)
  - Browser: 256/128 (standard for web content)
  - WeChat: 192/64 (smaller chunks for chat messages)
- Make --file-types optional filter instead of restriction in document_rag
- Update README to clarify interactive mode and parameter usage
- Fix LLM default model documentation (gpt-4o, not gpt-4o-mini)

* feat: Update documentation based on review feedback

- Add MLX embedding example to README
- Clarify examples/data content description (two papers, Pride and Prejudice, Chinese README)
- Move chunk parameters to common parameters section
- Remove duplicate chunk parameters from document-specific section

* docs: Emphasize diverse data sources in examples/data description

* fix: update default embedding models for better performance

- Change WeChat, Browser, and Email RAG examples to use all-MiniLM-L6-v2
- Previous Qwen/Qwen3-Embedding-0.6B was too slow for these use cases
- all-MiniLM-L6-v2 is a fast 384-dim model, ideal for large-scale personal data

* add response highlight

* change rebuild logic

* fix some example

* feat: check if k is larger than #docs

* fix: WeChat history reader bugs and refactor wechat_rag to use unified architecture

* fix email wrong -1 to process all file

* refactor: reorgnize all examples/ and test/

* refactor: reorganize examples and add link checker

* fix: add init.py

* fix: handle certificate errors in link checker

* fix wechat

* merge

* docs: update README to use proper module imports for apps

- Change from 'python apps/xxx.py' to 'python -m apps.xxx'
- More professional and pythonic module calling
- Ensures proper module resolution and imports
- Better separation between apps/ (production tools) and examples/ (demos)

---------

Co-authored-by: yichuan520030910320 <yichuan_wang@berkeley.edu>
2025-08-03 23:06:24 -07:00

121 lines
3.7 KiB
Python

"""
Test document_rag functionality using pytest.
"""
import os
import subprocess
import sys
import tempfile
from pathlib import Path
import pytest
@pytest.fixture
def test_data_dir():
"""Return the path to test data directory."""
return Path("data")
@pytest.mark.skipif(
os.environ.get("CI") == "true", reason="Skip model tests in CI to avoid MPS memory issues"
)
def test_document_rag_simulated(test_data_dir):
"""Test document_rag with simulated LLM."""
with tempfile.TemporaryDirectory() as temp_dir:
# Use a subdirectory that doesn't exist yet to force index creation
index_dir = Path(temp_dir) / "test_index"
cmd = [
sys.executable,
"apps/document_rag.py",
"--llm",
"simulated",
"--embedding-model",
"facebook/contriever",
"--embedding-mode",
"sentence-transformers",
"--index-dir",
str(index_dir),
"--data-dir",
str(test_data_dir),
"--query",
"What is Pride and Prejudice about?",
]
env = os.environ.copy()
env["HF_HUB_DISABLE_SYMLINKS"] = "1"
env["TOKENIZERS_PARALLELISM"] = "false"
result = subprocess.run(cmd, capture_output=True, text=True, timeout=600, env=env)
# Check return code
assert result.returncode == 0, f"Command failed: {result.stderr}"
# Verify output
output = result.stdout + result.stderr
assert "Index saved to" in output or "Using existing index" in output
assert "This is a simulated answer" in output
@pytest.mark.skipif(not os.environ.get("OPENAI_API_KEY"), reason="OpenAI API key not available")
def test_document_rag_openai(test_data_dir):
"""Test document_rag with OpenAI embeddings."""
with tempfile.TemporaryDirectory() as temp_dir:
# Use a subdirectory that doesn't exist yet to force index creation
index_dir = Path(temp_dir) / "test_index_openai"
cmd = [
sys.executable,
"apps/document_rag.py",
"--llm",
"simulated", # Use simulated LLM to avoid GPT-4 costs
"--embedding-model",
"text-embedding-3-small",
"--embedding-mode",
"openai",
"--index-dir",
str(index_dir),
"--data-dir",
str(test_data_dir),
"--query",
"What is Pride and Prejudice about?",
]
env = os.environ.copy()
env["TOKENIZERS_PARALLELISM"] = "false"
result = subprocess.run(cmd, capture_output=True, text=True, timeout=600, env=env)
assert result.returncode == 0, f"Command failed: {result.stderr}"
# Verify cosine distance was used
output = result.stdout + result.stderr
assert any(
msg in output
for msg in [
"distance_metric='cosine'",
"Automatically setting distance_metric='cosine'",
"Using cosine distance",
]
)
def test_document_rag_error_handling(test_data_dir):
"""Test document_rag with invalid parameters."""
with tempfile.TemporaryDirectory() as temp_dir:
cmd = [
sys.executable,
"apps/document_rag.py",
"--llm",
"invalid_llm_type",
"--index-dir",
temp_dir,
"--data-dir",
str(test_data_dir),
]
result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
# Should fail with invalid LLM type
assert result.returncode != 0
assert "invalid choice" in result.stderr or "invalid_llm_type" in result.stderr