* refactor: Unify examples interface with BaseRAGExample - Create BaseRAGExample base class for all RAG examples - Refactor 4 examples to use unified interface: - document_rag.py (replaces main_cli_example.py) - email_rag.py (replaces mail_reader_leann.py) - browser_rag.py (replaces google_history_reader_leann.py) - wechat_rag.py (replaces wechat_history_reader_leann.py) - Maintain 100% parameter compatibility with original files - Add interactive mode support for all examples - Unify parameter names (--max-items replaces --max-emails/--max-entries) - Update README.md with new examples usage - Add PARAMETER_CONSISTENCY.md documenting all parameter mappings - Keep main_cli_example.py for backward compatibility with migration notice All default values, LeannBuilder parameters, and chunking settings remain identical to ensure full compatibility with existing indexes. * fix: Update CI tests for new unified examples interface - Rename test_main_cli.py to test_document_rag.py - Update all references from main_cli_example.py to document_rag.py - Update tests/README.md documentation The tests now properly test the new unified interface while maintaining the same test coverage and functionality. * fix: Fix pre-commit issues and update tests - Fix import sorting and unused imports - Update type annotations to use built-in types (list, dict) instead of typing.List/Dict - Fix trailing whitespace and end-of-file issues - Fix Chinese fullwidth comma to regular comma - Update test_main_cli.py to test_document_rag.py - Add backward compatibility test for main_cli_example.py - Pass all pre-commit hooks (ruff, ruff-format, etc.) * refactor: Remove old example scripts and migration references - Delete old example scripts (mail_reader_leann.py, google_history_reader_leann.py, etc.) - Remove migration hints and backward compatibility - Update tests to use new unified examples directly - Clean up all references to old script names - Users now only see the new unified interface * fix: Restore embedding-mode parameter to all examples - All examples now have --embedding-mode parameter (unified interface benefit) - Default is 'sentence-transformers' (consistent with original behavior) - Users can now use OpenAI or MLX embeddings with any data source - Maintains functional equivalence with original scripts * docs: Improve parameter categorization in README - Clearly separate core (shared) vs specific parameters - Move LLM and embedding examples to 'Example Commands' section - Add descriptive comments for all specific parameters - Keep only truly data-source-specific parameters in specific sections * docs: Make example commands more representative - Add default values to parameter descriptions - Replace generic examples with real-world use cases - Focus on data-source-specific features in examples - Remove redundant demonstrations of common parameters * docs: Reorganize parameter documentation structure - Move common parameters to a dedicated section before all examples - Rename sections to 'X-Specific Arguments' for clarity - Remove duplicate common parameters from individual examples - Better information architecture for users * docs: polish applications * docs: Add CLI installation instructions - Add two installation options: venv and global uv tool - Clearly explain when to use each option - Make CLI more accessible for daily use * docs: Clarify CLI global installation process - Explain the transition from venv to global installation - Add upgrade command for global installation - Make it clear that global install allows usage without venv activation * docs: Add collapsible section for CLI installation - Wrap CLI installation instructions in details/summary tags - Keep consistent with other collapsible sections in README - Improve document readability and navigation * style: format * docs: Fix collapsible sections - Make Common Parameters collapsible (as it's lengthy reference material) - Keep CLI Installation visible (important for users to see immediately) - Better information hierarchy * docs: Add introduction for Common Parameters section - Add 'Flexible Configuration' heading with descriptive sentence - Create parallel structure with 'Generation Model Setup' section - Improve document flow and readability * docs: nit * fix: Fix issues in unified examples - Add smart path detection for data directory - Fix add_texts -> add_text method call - Handle both running from project root and examples directory * fix: Fix async/await and add_text issues in unified examples - Remove incorrect await from chat.ask() calls (not async) - Fix add_texts -> add_text method calls - Verify search-complexity correctly maps to efSearch parameter - All examples now run successfully * feat: Address review comments - Add complexity parameter to LeannChat initialization (default: search_complexity) - Fix chunk-size default in README documentation (256, not 2048) - Add more index building parameters as CLI arguments: - --backend-name (hnsw/diskann) - --graph-degree (default: 32) - --build-complexity (default: 64) - --no-compact (disable compact storage) - --no-recompute (disable embedding recomputation) - Update README to document all new parameters * feat: Add chunk-size parameters and improve file type filtering - Add --chunk-size and --chunk-overlap parameters to all RAG examples - Preserve original default values for each data source: - Document: 256/128 (optimized for general documents) - Email: 256/25 (smaller overlap for email threads) - Browser: 256/128 (standard for web content) - WeChat: 192/64 (smaller chunks for chat messages) - Make --file-types optional filter instead of restriction in document_rag - Update README to clarify interactive mode and parameter usage - Fix LLM default model documentation (gpt-4o, not gpt-4o-mini) * feat: Update documentation based on review feedback - Add MLX embedding example to README - Clarify examples/data content description (two papers, Pride and Prejudice, Chinese README) - Move chunk parameters to common parameters section - Remove duplicate chunk parameters from document-specific section * docs: Emphasize diverse data sources in examples/data description * fix: update default embedding models for better performance - Change WeChat, Browser, and Email RAG examples to use all-MiniLM-L6-v2 - Previous Qwen/Qwen3-Embedding-0.6B was too slow for these use cases - all-MiniLM-L6-v2 is a fast 384-dim model, ideal for large-scale personal data * add response highlight * change rebuild logic * fix some example * feat: check if k is larger than #docs * fix: WeChat history reader bugs and refactor wechat_rag to use unified architecture * fix email wrong -1 to process all file * refactor: reorgnize all examples/ and test/ * refactor: reorganize examples and add link checker * fix: add init.py * fix: handle certificate errors in link checker * fix wechat * merge * docs: update README to use proper module imports for apps - Change from 'python apps/xxx.py' to 'python -m apps.xxx' - More professional and pythonic module calling - Ensures proper module resolution and imports - Better separation between apps/ (production tools) and examples/ (demos) --------- Co-authored-by: yichuan520030910320 <yichuan_wang@berkeley.edu>
187 lines
6.4 KiB
Python
187 lines
6.4 KiB
Python
"""
|
|
Mbox parser.
|
|
|
|
Contains simple parser for mbox files.
|
|
|
|
"""
|
|
|
|
import logging
|
|
from pathlib import Path
|
|
from typing import Any
|
|
|
|
from fsspec import AbstractFileSystem
|
|
from llama_index.core.readers.base import BaseReader
|
|
from llama_index.core.schema import Document
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
class MboxReader(BaseReader):
|
|
"""
|
|
Mbox parser.
|
|
|
|
Extract messages from mailbox files.
|
|
Returns string including date, subject, sender, receiver and
|
|
content for each message.
|
|
|
|
"""
|
|
|
|
DEFAULT_MESSAGE_FORMAT: str = (
|
|
"Date: {_date}\nFrom: {_from}\nTo: {_to}\nSubject: {_subject}\nContent: {_content}"
|
|
)
|
|
|
|
def __init__(
|
|
self,
|
|
*args: Any,
|
|
max_count: int = 0,
|
|
message_format: str = DEFAULT_MESSAGE_FORMAT,
|
|
**kwargs: Any,
|
|
) -> None:
|
|
"""Init params."""
|
|
try:
|
|
from bs4 import BeautifulSoup # noqa
|
|
except ImportError:
|
|
raise ImportError("`beautifulsoup4` package not found: `pip install beautifulsoup4`")
|
|
|
|
super().__init__(*args, **kwargs)
|
|
self.max_count = max_count
|
|
self.message_format = message_format
|
|
|
|
def load_data(
|
|
self,
|
|
file: Path,
|
|
extra_info: dict | None = None,
|
|
fs: AbstractFileSystem | None = None,
|
|
) -> list[Document]:
|
|
"""Parse file into string."""
|
|
# Import required libraries
|
|
import mailbox
|
|
from email.parser import BytesParser
|
|
from email.policy import default
|
|
|
|
from bs4 import BeautifulSoup
|
|
|
|
if fs:
|
|
logger.warning(
|
|
"fs was specified but MboxReader doesn't support loading "
|
|
"from fsspec filesystems. Will load from local filesystem instead."
|
|
)
|
|
|
|
i = 0
|
|
results: list[str] = []
|
|
# Load file using mailbox
|
|
bytes_parser = BytesParser(policy=default).parse
|
|
mbox = mailbox.mbox(file, factory=bytes_parser) # type: ignore
|
|
|
|
# Iterate through all messages
|
|
for _, _msg in enumerate(mbox):
|
|
try:
|
|
msg: mailbox.mboxMessage = _msg
|
|
# Parse multipart messages
|
|
if msg.is_multipart():
|
|
for part in msg.walk():
|
|
ctype = part.get_content_type()
|
|
cdispo = str(part.get("Content-Disposition"))
|
|
if "attachment" in cdispo:
|
|
print(f"Attachment found: {part.get_filename()}")
|
|
if ctype == "text/plain" and "attachment" not in cdispo:
|
|
content = part.get_payload(decode=True) # decode
|
|
break
|
|
# Get plain message payload for non-multipart messages
|
|
else:
|
|
content = msg.get_payload(decode=True)
|
|
|
|
# Parse message HTML content and remove unneeded whitespace
|
|
soup = BeautifulSoup(content)
|
|
stripped_content = " ".join(soup.get_text().split())
|
|
# Format message to include date, sender, receiver and subject
|
|
msg_string = self.message_format.format(
|
|
_date=msg["date"],
|
|
_from=msg["from"],
|
|
_to=msg["to"],
|
|
_subject=msg["subject"],
|
|
_content=stripped_content,
|
|
)
|
|
# Add message string to results
|
|
results.append(msg_string)
|
|
except Exception as e:
|
|
logger.warning(f"Failed to parse message:\n{_msg}\n with exception {e}")
|
|
|
|
# Increment counter and return if max count is met
|
|
i += 1
|
|
if self.max_count > 0 and i >= self.max_count:
|
|
break
|
|
|
|
return [Document(text=result, metadata=extra_info or {}) for result in results]
|
|
|
|
|
|
class EmlxMboxReader(MboxReader):
|
|
"""
|
|
EmlxMboxReader - Modified MboxReader that handles directories of .emlx files.
|
|
|
|
Extends MboxReader to work with Apple Mail's .emlx format by:
|
|
1. Reading .emlx files from a directory
|
|
2. Converting them to mbox format in memory
|
|
3. Using the parent MboxReader's parsing logic
|
|
"""
|
|
|
|
def load_data(
|
|
self,
|
|
directory: Path,
|
|
extra_info: dict | None = None,
|
|
fs: AbstractFileSystem | None = None,
|
|
) -> list[Document]:
|
|
"""Parse .emlx files from directory into strings using MboxReader logic."""
|
|
import os
|
|
import tempfile
|
|
|
|
if fs:
|
|
logger.warning(
|
|
"fs was specified but EmlxMboxReader doesn't support loading "
|
|
"from fsspec filesystems. Will load from local filesystem instead."
|
|
)
|
|
|
|
# Find all .emlx files in the directory
|
|
emlx_files = list(directory.glob("*.emlx"))
|
|
logger.info(f"Found {len(emlx_files)} .emlx files in {directory}")
|
|
|
|
if not emlx_files:
|
|
logger.warning(f"No .emlx files found in {directory}")
|
|
return []
|
|
|
|
# Create a temporary mbox file
|
|
with tempfile.NamedTemporaryFile(mode="w", suffix=".mbox", delete=False) as temp_mbox:
|
|
temp_mbox_path = temp_mbox.name
|
|
|
|
# Convert .emlx files to mbox format
|
|
for emlx_file in emlx_files:
|
|
try:
|
|
# Read the .emlx file
|
|
with open(emlx_file, encoding="utf-8", errors="ignore") as f:
|
|
content = f.read()
|
|
|
|
# .emlx format: first line is length, rest is email content
|
|
lines = content.split("\n", 1)
|
|
if len(lines) >= 2:
|
|
email_content = lines[1] # Skip the length line
|
|
|
|
# Write to mbox format (each message starts with "From " and ends with blank line)
|
|
temp_mbox.write(f"From {emlx_file.name} {email_content}\n\n")
|
|
|
|
except Exception as e:
|
|
logger.warning(f"Failed to process {emlx_file}: {e}")
|
|
continue
|
|
|
|
# Close the temporary file so MboxReader can read it
|
|
temp_mbox.close()
|
|
|
|
try:
|
|
# Use the parent MboxReader's logic to parse the mbox file
|
|
return super().load_data(Path(temp_mbox_path), extra_info, fs)
|
|
finally:
|
|
# Clean up temporary file
|
|
try:
|
|
os.unlink(temp_mbox_path)
|
|
except OSError:
|
|
pass
|