refactor: Unify examples interface with BaseRAGExample (#12)

* refactor: Unify examples interface with BaseRAGExample - Create BaseRAGExample base class for all RAG examples - Refactor 4 examples to use unified interface: - document_rag.py (replaces main_cli_example.py) - email_rag.py (replaces mail_reader_leann.py) - browser_rag.py (replaces google_history_reader_leann.py) - wechat_rag.py (replaces wechat_history_reader_leann.py) - Maintain 100% parameter compatibility with original files - Add interactive mode support for all examples - Unify parameter names (--max-items replaces --max-emails/--max-entries) - Update README.md with new examples usage - Add PARAMETER_CONSISTENCY.md documenting all parameter mappings - Keep main_cli_example.py for backward compatibility with migration notice All default values, LeannBuilder parameters, and chunking settings remain identical to ensure full compatibility with existing indexes. * fix: Update CI tests for new unified examples interface - Rename test_main_cli.py to test_document_rag.py - Update all references from main_cli_example.py to document_rag.py - Update tests/README.md documentation The tests now properly test the new unified interface while maintaining the same test coverage and functionality. * fix: Fix pre-commit issues and update tests - Fix import sorting and unused imports - Update type annotations to use built-in types (list, dict) instead of typing.List/Dict - Fix trailing whitespace and end-of-file issues - Fix Chinese fullwidth comma to regular comma - Update test_main_cli.py to test_document_rag.py - Add backward compatibility test for main_cli_example.py - Pass all pre-commit hooks (ruff, ruff-format, etc.) * refactor: Remove old example scripts and migration references - Delete old example scripts (mail_reader_leann.py, google_history_reader_leann.py, etc.) - Remove migration hints and backward compatibility - Update tests to use new unified examples directly - Clean up all references to old script names - Users now only see the new unified interface * fix: Restore embedding-mode parameter to all examples - All examples now have --embedding-mode parameter (unified interface benefit) - Default is 'sentence-transformers' (consistent with original behavior) - Users can now use OpenAI or MLX embeddings with any data source - Maintains functional equivalence with original scripts * docs: Improve parameter categorization in README - Clearly separate core (shared) vs specific parameters - Move LLM and embedding examples to 'Example Commands' section - Add descriptive comments for all specific parameters - Keep only truly data-source-specific parameters in specific sections * docs: Make example commands more representative - Add default values to parameter descriptions - Replace generic examples with real-world use cases - Focus on data-source-specific features in examples - Remove redundant demonstrations of common parameters * docs: Reorganize parameter documentation structure - Move common parameters to a dedicated section before all examples - Rename sections to 'X-Specific Arguments' for clarity - Remove duplicate common parameters from individual examples - Better information architecture for users * docs: polish applications * docs: Add CLI installation instructions - Add two installation options: venv and global uv tool - Clearly explain when to use each option - Make CLI more accessible for daily use * docs: Clarify CLI global installation process - Explain the transition from venv to global installation - Add upgrade command for global installation - Make it clear that global install allows usage without venv activation * docs: Add collapsible section for CLI installation - Wrap CLI installation instructions in details/summary tags - Keep consistent with other collapsible sections in README - Improve document readability and navigation * style: format * docs: Fix collapsible sections - Make Common Parameters collapsible (as it's lengthy reference material) - Keep CLI Installation visible (important for users to see immediately) - Better information hierarchy * docs: Add introduction for Common Parameters section - Add 'Flexible Configuration' heading with descriptive sentence - Create parallel structure with 'Generation Model Setup' section - Improve document flow and readability * docs: nit * fix: Fix issues in unified examples - Add smart path detection for data directory - Fix add_texts -> add_text method call - Handle both running from project root and examples directory * fix: Fix async/await and add_text issues in unified examples - Remove incorrect await from chat.ask() calls (not async) - Fix add_texts -> add_text method calls - Verify search-complexity correctly maps to efSearch parameter - All examples now run successfully * feat: Address review comments - Add complexity parameter to LeannChat initialization (default: search_complexity) - Fix chunk-size default in README documentation (256, not 2048) - Add more index building parameters as CLI arguments: - --backend-name (hnsw/diskann) - --graph-degree (default: 32) - --build-complexity (default: 64) - --no-compact (disable compact storage) - --no-recompute (disable embedding recomputation) - Update README to document all new parameters * feat: Add chunk-size parameters and improve file type filtering - Add --chunk-size and --chunk-overlap parameters to all RAG examples - Preserve original default values for each data source: - Document: 256/128 (optimized for general documents) - Email: 256/25 (smaller overlap for email threads) - Browser: 256/128 (standard for web content) - WeChat: 192/64 (smaller chunks for chat messages) - Make --file-types optional filter instead of restriction in document_rag - Update README to clarify interactive mode and parameter usage - Fix LLM default model documentation (gpt-4o, not gpt-4o-mini) * feat: Update documentation based on review feedback - Add MLX embedding example to README - Clarify examples/data content description (two papers, Pride and Prejudice, Chinese README) - Move chunk parameters to common parameters section - Remove duplicate chunk parameters from document-specific section * docs: Emphasize diverse data sources in examples/data description * fix: update default embedding models for better performance - Change WeChat, Browser, and Email RAG examples to use all-MiniLM-L6-v2 - Previous Qwen/Qwen3-Embedding-0.6B was too slow for these use cases - all-MiniLM-L6-v2 is a fast 384-dim model, ideal for large-scale personal data * add response highlight * change rebuild logic * fix some example * feat: check if k is larger than #docs * fix: WeChat history reader bugs and refactor wechat_rag to use unified architecture * fix email wrong -1 to process all file * refactor: reorgnize all examples/ and test/ * refactor: reorganize examples and add link checker * fix: add init.py * fix: handle certificate errors in link checker * fix wechat * merge * docs: update README to use proper module imports for apps - Change from 'python apps/xxx.py' to 'python -m apps.xxx' - More professional and pythonic module calling - Ensures proper module resolution and imports - Better separation between apps/ (production tools) and examples/ (demos) --------- Co-authored-by: yichuan520030910320 <yichuan_wang@berkeley.edu>
2025-08-03 23:06:24 -07:00
parent 54df6310c5
commit 8899734952
50 changed files with 1293 additions and 3193 deletions
--- a/apps/history_data/init.py
+++ b/apps/history_data/init.py
@@ -0,0 +1,3 @@
+from .history import ChromeHistoryReader
+
+__all__ = ["ChromeHistoryReader"]
--- a/apps/history_data/history.py
+++ b/apps/history_data/history.py
@@ -0,0 +1,186 @@
+import os
+import sqlite3
+from pathlib import Path
+from typing import Any
+
+from llama_index.core import Document
+from llama_index.core.readers.base import BaseReader
+
+
+class ChromeHistoryReader(BaseReader):
+    """
+    Chrome browser history reader that extracts browsing data from SQLite database.
+
+    Reads Chrome history from the default Chrome profile location and creates documents
+    with embedded metadata similar to the email reader structure.
+    """
+
+    def __init__(self) -> None:
+        """Initialize."""
+        pass
+
+    def load_data(self, input_dir: str | None = None, **load_kwargs: Any) -> list[Document]:
+        """
+        Load Chrome history data from the default Chrome profile location.
+
+        Args:
+            input_dir: Not used for Chrome history (kept for compatibility)
+            **load_kwargs:
+                max_count (int): Maximum amount of history entries to read.
+                chrome_profile_path (str): Custom path to Chrome profile directory.
+        """
+        docs: list[Document] = []
+        max_count = load_kwargs.get("max_count", 1000)
+        chrome_profile_path = load_kwargs.get("chrome_profile_path", None)
+
+        # Default Chrome profile path on macOS
+        if chrome_profile_path is None:
+            chrome_profile_path = os.path.expanduser(
+                "~/Library/Application Support/Google/Chrome/Default"
+            )
+
+        history_db_path = os.path.join(chrome_profile_path, "History")
+
+        if not os.path.exists(history_db_path):
+            print(f"Chrome history database not found at: {history_db_path}")
+            return docs
+
+        try:
+            # Connect to the Chrome history database
+            print(f"Connecting to database: {history_db_path}")
+            conn = sqlite3.connect(history_db_path)
+            cursor = conn.cursor()
+
+            # Query to get browsing history with metadata (removed created_time column)
+            query = """
+            SELECT
+                datetime(last_visit_time/1000000-11644473600,'unixepoch','localtime') as last_visit,
+                url,
+                title,
+                visit_count,
+                typed_count,
+                hidden
+            FROM urls
+            ORDER BY last_visit_time DESC
+            """
+
+            print(f"Executing query on database: {history_db_path}")
+            cursor.execute(query)
+            rows = cursor.fetchall()
+            print(f"Query returned {len(rows)} rows")
+
+            count = 0
+            for row in rows:
+                if count >= max_count and max_count > 0:
+                    break
+
+                last_visit, url, title, visit_count, typed_count, hidden = row
+
+                # Create document content with metadata embedded in text
+                doc_content = f"""
+[Title]: {title}
+[URL of the page]: {url}
+[Last visited time]: {last_visit}
+[Visit times]: {visit_count}
+[Typed times]: {typed_count}
+"""
+
+                # Create document with embedded metadata
+                doc = Document(text=doc_content, metadata={"title": title[0:150]})
+                # if len(title) > 150:
+                #     print(f"Title is too long: {title}")
+                docs.append(doc)
+                count += 1
+
+            conn.close()
+            print(f"Loaded {len(docs)} Chrome history documents")
+
+        except Exception as e:
+            print(f"Error reading Chrome history: {e}")
+            # add you may need to close your browser to make the database file available
+            # also highlight in red
+            print(
+                "\033[91mYou may need to close your browser to make the database file available\033[0m"
+            )
+            return docs
+
+        return docs
+
+    @staticmethod
+    def find_chrome_profiles() -> list[Path]:
+        """
+        Find all Chrome profile directories.
+
+        Returns:
+            List of Path objects pointing to Chrome profile directories
+        """
+        chrome_base_path = Path(os.path.expanduser("~/Library/Application Support/Google/Chrome"))
+        profile_dirs = []
+
+        if not chrome_base_path.exists():
+            print(f"Chrome directory not found at: {chrome_base_path}")
+            return profile_dirs
+
+        # Find all profile directories
+        for profile_dir in chrome_base_path.iterdir():
+            if profile_dir.is_dir() and profile_dir.name != "System Profile":
+                history_path = profile_dir / "History"
+                if history_path.exists():
+                    profile_dirs.append(profile_dir)
+                    print(f"Found Chrome profile: {profile_dir}")
+
+        print(f"Found {len(profile_dirs)} Chrome profiles")
+        return profile_dirs
+
+    @staticmethod
+    def export_history_to_file(
+        output_file: str = "chrome_history_export.txt", max_count: int = 1000
+    ):
+        """
+        Export Chrome history to a text file using the same SQL query format.
+
+        Args:
+            output_file: Path to the output file
+            max_count: Maximum number of entries to export
+        """
+        chrome_profile_path = os.path.expanduser(
+            "~/Library/Application Support/Google/Chrome/Default"
+        )
+        history_db_path = os.path.join(chrome_profile_path, "History")
+
+        if not os.path.exists(history_db_path):
+            print(f"Chrome history database not found at: {history_db_path}")
+            return
+
+        try:
+            conn = sqlite3.connect(history_db_path)
+            cursor = conn.cursor()
+
+            query = """
+            SELECT
+                datetime(last_visit_time/1000000-11644473600,'unixepoch','localtime') as last_visit,
+                url,
+                title,
+                visit_count,
+                typed_count,
+                hidden
+            FROM urls
+            ORDER BY last_visit_time DESC
+            LIMIT ?
+            """
+
+            cursor.execute(query, (max_count,))
+            rows = cursor.fetchall()
+
+            with open(output_file, "w", encoding="utf-8") as f:
+                for row in rows:
+                    last_visit, url, title, visit_count, typed_count, hidden = row
+                    f.write(
+                        f"{last_visit}\t{url}\t{title}\t{visit_count}\t{typed_count}\t{hidden}\n"
+                    )
+
+            conn.close()
+            print(f"Exported {len(rows)} history entries to {output_file}")
+
+        except Exception as e:
+            print(f"Error exporting Chrome history: {e}")
--- a/apps/history_data/wechat_history.py
+++ b/apps/history_data/wechat_history.py
@@ -0,0 +1,774 @@
+import json
+import os
+import re
+import subprocess
+import time
+from datetime import datetime
+from pathlib import Path
+from typing import Any
+
+from llama_index.core import Document
+from llama_index.core.readers.base import BaseReader
+
+
+class WeChatHistoryReader(BaseReader):
+    """
+    WeChat chat history reader that extracts chat data from exported JSON files.
+
+    Reads WeChat chat history from exported JSON files (from wechat-exporter tool)
+    and creates documents with embedded metadata similar to the Chrome history reader structure.
+
+    Also includes utilities for automatic WeChat chat history export.
+    """
+
+    def __init__(self) -> None:
+        """Initialize."""
+        self.packages_dir = Path(__file__).parent.parent.parent / "packages"
+        self.wechat_exporter_dir = self.packages_dir / "wechat-exporter"
+        self.wechat_decipher_dir = self.packages_dir / "wechat-decipher-macos"
+
+    def check_wechat_running(self) -> bool:
+        """Check if WeChat is currently running."""
+        try:
+            result = subprocess.run(["pgrep", "-f", "WeChat"], capture_output=True, text=True)
+            return result.returncode == 0
+        except Exception:
+            return False
+
+    def install_wechattweak(self) -> bool:
+        """Install WeChatTweak CLI tool."""
+        try:
+            # Create wechat-exporter directory if it doesn't exist
+            self.wechat_exporter_dir.mkdir(parents=True, exist_ok=True)
+
+            wechattweak_path = self.wechat_exporter_dir / "wechattweak-cli"
+            if not wechattweak_path.exists():
+                print("Downloading WeChatTweak CLI...")
+                subprocess.run(
+                    [
+                        "curl",
+                        "-L",
+                        "-o",
+                        str(wechattweak_path),
+                        "https://github.com/JettChenT/WeChatTweak-CLI/releases/latest/download/wechattweak-cli",
+                    ],
+                    check=True,
+                )
+
+            # Make executable
+            wechattweak_path.chmod(0o755)
+
+            # Install WeChatTweak
+            print("Installing WeChatTweak...")
+            subprocess.run(["sudo", str(wechattweak_path), "install"], check=True)
+            return True
+        except Exception as e:
+            print(f"Error installing WeChatTweak: {e}")
+            return False
+
+    def restart_wechat(self):
+        """Restart WeChat to apply WeChatTweak."""
+        try:
+            print("Restarting WeChat...")
+            subprocess.run(["pkill", "-f", "WeChat"], check=False)
+            time.sleep(2)
+            subprocess.run(["open", "-a", "WeChat"], check=True)
+            time.sleep(5)  # Wait for WeChat to start
+        except Exception as e:
+            print(f"Error restarting WeChat: {e}")
+
+    def check_api_available(self) -> bool:
+        """Check if WeChatTweak API is available."""
+        try:
+            result = subprocess.run(
+                ["curl", "-s", "http://localhost:48065/wechat/allcontacts"],
+                capture_output=True,
+                text=True,
+                timeout=5,
+            )
+            return result.returncode == 0 and result.stdout.strip()
+        except Exception:
+            return False
+
+    def _extract_readable_text(self, content: str) -> str:
+        """
+        Extract readable text from message content, removing XML and system messages.
+
+        Args:
+            content: The raw message content (can be string or dict)
+
+        Returns:
+            Cleaned, readable text
+        """
+        if not content:
+            return ""
+
+        # Handle dictionary content (like quoted messages)
+        if isinstance(content, dict):
+            # Extract text from dictionary structure
+            text_parts = []
+            if "title" in content:
+                text_parts.append(str(content["title"]))
+            if "quoted" in content:
+                text_parts.append(str(content["quoted"]))
+            if "content" in content:
+                text_parts.append(str(content["content"]))
+            if "text" in content:
+                text_parts.append(str(content["text"]))
+
+            if text_parts:
+                return " | ".join(text_parts)
+            else:
+                # If we can't extract meaningful text from dict, return empty
+                return ""
+
+        # Handle string content
+        if not isinstance(content, str):
+            return ""
+
+        # Remove common prefixes like "wxid_xxx:\n"
+        clean_content = re.sub(r"^wxid_[^:]+:\s*", "", content)
+        clean_content = re.sub(r"^[^:]+:\s*", "", clean_content)
+
+        # If it's just XML or system message, return empty
+        if clean_content.strip().startswith("<") or "recalled a message" in clean_content:
+            return ""
+
+        return clean_content.strip()
+
+    def _is_text_message(self, content: str) -> bool:
+        """
+        Check if a message contains readable text content.
+
+        Args:
+            content: The message content (can be string or dict)
+
+        Returns:
+            True if the message contains readable text, False otherwise
+        """
+        if not content:
+            return False
+
+        # Handle dictionary content
+        if isinstance(content, dict):
+            # Check if dict has any readable text fields
+            text_fields = ["title", "quoted", "content", "text"]
+            for field in text_fields:
+                if content.get(field):
+                    return True
+            return False
+
+        # Handle string content
+        if not isinstance(content, str):
+            return False
+
+        # Skip image messages (contain XML with img tags)
+        if "<img" in content and "cdnurl" in content:
+            return False
+
+        # Skip emoji messages (contain emoji XML tags)
+        if "<emoji" in content and "productid" in content:
+            return False
+
+        # Skip voice messages
+        if "<voice" in content:
+            return False
+
+        # Skip video messages
+        if "<video" in content:
+            return False
+
+        # Skip file messages
+        if "<appmsg" in content and "appid" in content:
+            return False
+
+        # Skip system messages (like "recalled a message")
+        if "recalled a message" in content:
+            return False
+
+        # Check if there's actual readable text (not just XML or system messages)
+        # Remove common prefixes like "wxid_xxx:\n" and check for actual content
+        clean_content = re.sub(r"^wxid_[^:]+:\s*", "", content)
+        clean_content = re.sub(r"^[^:]+:\s*", "", clean_content)
+
+        # If after cleaning we have meaningful text, consider it readable
+        if len(clean_content.strip()) > 0 and not clean_content.strip().startswith("<"):
+            return True
+
+        return False
+
+    def _concatenate_messages(
+        self,
+        messages: list[dict],
+        max_length: int = 128,
+        time_window_minutes: int = 30,
+        overlap_messages: int = 0,
+    ) -> list[dict]:
+        """
+        Concatenate messages based on length and time rules.
+
+        Args:
+            messages: List of message dictionaries
+            max_length: Maximum length for concatenated message groups. Use -1 to disable length constraint.
+            time_window_minutes: Time window in minutes to group messages together. Use -1 to disable time constraint.
+            overlap_messages: Number of messages to overlap between consecutive groups
+
+        Returns:
+            List of concatenated message groups
+        """
+        if not messages:
+            return []
+
+        concatenated_groups = []
+        current_group = []
+        current_length = 0
+        last_timestamp = None
+
+        for message in messages:
+            # Extract message info
+            content = message.get("content", "")
+            message_text = message.get("message", "")
+            create_time = message.get("createTime", 0)
+            message.get("fromUser", "")
+            message.get("toUser", "")
+            message.get("isSentFromSelf", False)
+
+            # Extract readable text
+            readable_text = self._extract_readable_text(content)
+            if not readable_text:
+                readable_text = message_text
+
+            # Skip empty messages
+            if not readable_text.strip():
+                continue
+
+            # Check time window constraint (only if time_window_minutes != -1)
+            if time_window_minutes != -1 and last_timestamp is not None and create_time > 0:
+                time_diff_minutes = (create_time - last_timestamp) / 60
+                if time_diff_minutes > time_window_minutes:
+                    # Time gap too large, start new group
+                    if current_group:
+                        concatenated_groups.append(
+                            {
+                                "messages": current_group,
+                                "total_length": current_length,
+                                "start_time": current_group[0].get("createTime", 0),
+                                "end_time": current_group[-1].get("createTime", 0),
+                            }
+                        )
+                        # Keep last few messages for overlap
+                        if overlap_messages > 0 and len(current_group) > overlap_messages:
+                            current_group = current_group[-overlap_messages:]
+                            current_length = sum(
+                                len(
+                                    self._extract_readable_text(msg.get("content", ""))
+                                    or msg.get("message", "")
+                                )
+                                for msg in current_group
+                            )
+                        else:
+                            current_group = []
+                            current_length = 0
+
+            # Check length constraint (only if max_length != -1)
+            message_length = len(readable_text)
+            if max_length != -1 and current_length + message_length > max_length and current_group:
+                # Current group would exceed max length, save it and start new
+                concatenated_groups.append(
+                    {
+                        "messages": current_group,
+                        "total_length": current_length,
+                        "start_time": current_group[0].get("createTime", 0),
+                        "end_time": current_group[-1].get("createTime", 0),
+                    }
+                )
+                # Keep last few messages for overlap
+                if overlap_messages > 0 and len(current_group) > overlap_messages:
+                    current_group = current_group[-overlap_messages:]
+                    current_length = sum(
+                        len(
+                            self._extract_readable_text(msg.get("content", ""))
+                            or msg.get("message", "")
+                        )
+                        for msg in current_group
+                    )
+                else:
+                    current_group = []
+                    current_length = 0
+
+            # Add message to current group
+            current_group.append(message)
+            current_length += message_length
+            last_timestamp = create_time
+
+        # Add the last group if it exists
+        if current_group:
+            concatenated_groups.append(
+                {
+                    "messages": current_group,
+                    "total_length": current_length,
+                    "start_time": current_group[0].get("createTime", 0),
+                    "end_time": current_group[-1].get("createTime", 0),
+                }
+            )
+
+        return concatenated_groups
+
+    def _create_concatenated_content(self, message_group: dict, contact_name: str) -> str:
+        """
+        Create concatenated content from a group of messages.
+
+        Args:
+            message_group: Dictionary containing messages and metadata
+            contact_name: Name of the contact
+
+        Returns:
+            Formatted concatenated content
+        """
+        messages = message_group["messages"]
+        start_time = message_group["start_time"]
+        end_time = message_group["end_time"]
+
+        # Format timestamps
+        if start_time:
+            try:
+                start_timestamp = datetime.fromtimestamp(start_time)
+                start_time_str = start_timestamp.strftime("%Y-%m-%d %H:%M:%S")
+            except (ValueError, OSError):
+                start_time_str = str(start_time)
+        else:
+            start_time_str = "Unknown"
+
+        if end_time:
+            try:
+                end_timestamp = datetime.fromtimestamp(end_time)
+                end_time_str = end_timestamp.strftime("%Y-%m-%d %H:%M:%S")
+            except (ValueError, OSError):
+                end_time_str = str(end_time)
+        else:
+            end_time_str = "Unknown"
+
+        # Build concatenated message content
+        message_parts = []
+        for message in messages:
+            content = message.get("content", "")
+            message_text = message.get("message", "")
+            create_time = message.get("createTime", 0)
+            is_sent_from_self = message.get("isSentFromSelf", False)
+
+            # Extract readable text
+            readable_text = self._extract_readable_text(content)
+            if not readable_text:
+                readable_text = message_text
+
+            # Format individual message
+            if create_time:
+                try:
+                    timestamp = datetime.fromtimestamp(create_time)
+                    # change to YYYY-MM-DD HH:MM:SS
+                    time_str = timestamp.strftime("%Y-%m-%d %H:%M:%S")
+                except (ValueError, OSError):
+                    time_str = str(create_time)
+            else:
+                time_str = "Unknown"
+
+            sender = "[Me]" if is_sent_from_self else "[Contact]"
+            message_parts.append(f"({time_str}) {sender}: {readable_text}")
+
+        concatenated_text = "\n".join(message_parts)
+
+        # Create final document content
+        doc_content = f"""
+Contact: {contact_name}
+Time Range: {start_time_str} - {end_time_str}
+Messages ({len(messages)} messages, {message_group["total_length"]} chars):
+
+{concatenated_text}
+"""
+        # TODO @yichuan give better format and rich info here!
+        doc_content = f"""
+{concatenated_text}
+"""
+        return doc_content, contact_name
+
+    def load_data(self, input_dir: str | None = None, **load_kwargs: Any) -> list[Document]:
+        """
+        Load WeChat chat history data from exported JSON files.
+
+        Args:
+            input_dir: Directory containing exported WeChat JSON files
+            **load_kwargs:
+                max_count (int): Maximum amount of chat entries to read.
+                wechat_export_dir (str): Custom path to WeChat export directory.
+                include_non_text (bool): Whether to include non-text messages (images, emojis, etc.)
+                concatenate_messages (bool): Whether to concatenate messages based on length rules.
+                max_length (int): Maximum length for concatenated message groups (default: 1000).
+                time_window_minutes (int): Time window in minutes to group messages together (default: 30).
+                overlap_messages (int): Number of messages to overlap between consecutive groups (default: 2).
+        """
+        docs: list[Document] = []
+        max_count = load_kwargs.get("max_count", 1000)
+        wechat_export_dir = load_kwargs.get("wechat_export_dir", None)
+        include_non_text = load_kwargs.get("include_non_text", False)
+        concatenate_messages = load_kwargs.get("concatenate_messages", False)
+        max_length = load_kwargs.get("max_length", 1000)
+        time_window_minutes = load_kwargs.get("time_window_minutes", 30)
+
+        # Default WeChat export path
+        if wechat_export_dir is None:
+            wechat_export_dir = "./wechat_export_test"
+
+        if not os.path.exists(wechat_export_dir):
+            print(f"WeChat export directory not found at: {wechat_export_dir}")
+            return docs
+
+        try:
+            # Find all JSON files in the export directory
+            json_files = list(Path(wechat_export_dir).glob("*.json"))
+            print(f"Found {len(json_files)} WeChat chat history files")
+
+            count = 0
+            for json_file in json_files:
+                if count >= max_count and max_count > 0:
+                    break
+
+                try:
+                    with open(json_file, encoding="utf-8") as f:
+                        chat_data = json.load(f)
+
+                    # Extract contact name from filename
+                    contact_name = json_file.stem
+
+                    if concatenate_messages:
+                        # Filter messages to only include readable text messages
+                        readable_messages = []
+                        for message in chat_data:
+                            try:
+                                content = message.get("content", "")
+                                if not include_non_text and not self._is_text_message(content):
+                                    continue
+
+                                readable_text = self._extract_readable_text(content)
+                                if not readable_text and not include_non_text:
+                                    continue
+
+                                readable_messages.append(message)
+                            except Exception as e:
+                                print(f"Error processing message in {json_file}: {e}")
+                                continue
+
+                        # Concatenate messages based on rules
+                        message_groups = self._concatenate_messages(
+                            readable_messages,
+                            max_length=max_length,
+                            time_window_minutes=time_window_minutes,
+                            overlap_messages=0,  # No overlap between groups
+                        )
+
+                        # Create documents from concatenated groups
+                        for message_group in message_groups:
+                            if count >= max_count and max_count > 0:
+                                break
+
+                            doc_content, contact_name = self._create_concatenated_content(
+                                message_group, contact_name
+                            )
+                            doc = Document(
+                                text=doc_content,
+                                metadata={"contact_name": contact_name},
+                            )
+                            docs.append(doc)
+                            count += 1
+
+                        print(
+                            f"Created {len(message_groups)} concatenated message groups for {contact_name}"
+                        )
+
+                    else:
+                        # Original single-message processing
+                        for message in chat_data:
+                            if count >= max_count and max_count > 0:
+                                break
+
+                            # Extract message information
+                            message.get("fromUser", "")
+                            message.get("toUser", "")
+                            content = message.get("content", "")
+                            message_text = message.get("message", "")
+                            create_time = message.get("createTime", 0)
+                            is_sent_from_self = message.get("isSentFromSelf", False)
+
+                            # Handle content that might be dict or string
+                            try:
+                                # Check if this is a readable text message
+                                if not include_non_text and not self._is_text_message(content):
+                                    continue
+
+                                # Extract readable text
+                                readable_text = self._extract_readable_text(content)
+                                if not readable_text and not include_non_text:
+                                    continue
+                            except Exception as e:
+                                # Skip messages that cause processing errors
+                                print(f"Error processing message in {json_file}: {e}")
+                                continue
+
+                            # Convert timestamp to readable format
+                            if create_time:
+                                try:
+                                    timestamp = datetime.fromtimestamp(create_time)
+                                    time_str = timestamp.strftime("%Y-%m-%d %H:%M:%S")
+                                except (ValueError, OSError):
+                                    time_str = str(create_time)
+                            else:
+                                time_str = "Unknown"
+
+                            # Create document content with metadata header and contact info
+                            doc_content = f"""
+Contact: {contact_name}
+Is sent from self: {is_sent_from_self}
+Time: {time_str}
+Message: {readable_text if readable_text else message_text}
+"""
+
+                            # Create document with embedded metadata
+                            doc = Document(
+                                text=doc_content, metadata={"contact_name": contact_name}
+                            )
+                            docs.append(doc)
+                            count += 1
+
+                except Exception as e:
+                    print(f"Error reading {json_file}: {e}")
+                    continue
+
+            print(f"Loaded {len(docs)} WeChat chat documents")
+
+        except Exception as e:
+            print(f"Error reading WeChat history: {e}")
+            return docs
+
+        return docs
+
+    @staticmethod
+    def find_wechat_export_dirs() -> list[Path]:
+        """
+        Find all WeChat export directories.
+
+        Returns:
+            List of Path objects pointing to WeChat export directories
+        """
+        export_dirs = []
+
+        # Look for common export directory names
+        possible_dirs = [
+            Path("./wechat_export"),
+            Path("./wechat_export_direct"),
+            Path("./wechat_chat_history"),
+            Path("./chat_export"),
+        ]
+
+        for export_dir in possible_dirs:
+            if export_dir.exists() and export_dir.is_dir():
+                json_files = list(export_dir.glob("*.json"))
+                if json_files:
+                    export_dirs.append(export_dir)
+                    print(
+                        f"Found WeChat export directory: {export_dir} with {len(json_files)} files"
+                    )
+
+        print(f"Found {len(export_dirs)} WeChat export directories")
+        return export_dirs
+
+    @staticmethod
+    def export_chat_to_file(
+        output_file: str = "wechat_chat_export.txt",
+        max_count: int = 1000,
+        export_dir: str | None = None,
+        include_non_text: bool = False,
+    ):
+        """
+        Export WeChat chat history to a text file.
+
+        Args:
+            output_file: Path to the output file
+            max_count: Maximum number of entries to export
+            export_dir: Directory containing WeChat JSON files
+            include_non_text: Whether to include non-text messages
+        """
+        if export_dir is None:
+            export_dir = "./wechat_export_test"
+
+        if not os.path.exists(export_dir):
+            print(f"WeChat export directory not found at: {export_dir}")
+            return
+
+        try:
+            json_files = list(Path(export_dir).glob("*.json"))
+
+            with open(output_file, "w", encoding="utf-8") as f:
+                count = 0
+                for json_file in json_files:
+                    if count >= max_count and max_count > 0:
+                        break
+
+                    try:
+                        with open(json_file, encoding="utf-8") as json_f:
+                            chat_data = json.load(json_f)
+
+                        contact_name = json_file.stem
+                        f.write(f"\n=== Chat with {contact_name} ===\n")
+
+                        for message in chat_data:
+                            if count >= max_count and max_count > 0:
+                                break
+
+                            from_user = message.get("fromUser", "")
+                            content = message.get("content", "")
+                            message_text = message.get("message", "")
+                            create_time = message.get("createTime", 0)
+
+                            # Skip non-text messages unless requested
+                            if not include_non_text:
+                                reader = WeChatHistoryReader()
+                                if not reader._is_text_message(content):
+                                    continue
+                                readable_text = reader._extract_readable_text(content)
+                                if not readable_text:
+                                    continue
+                                message_text = readable_text
+
+                            if create_time:
+                                try:
+                                    timestamp = datetime.fromtimestamp(create_time)
+                                    time_str = timestamp.strftime("%Y-%m-%d %H:%M:%S")
+                                except (ValueError, OSError):
+                                    time_str = str(create_time)
+                            else:
+                                time_str = "Unknown"
+
+                            f.write(f"[{time_str}] {from_user}: {message_text}\n")
+                            count += 1
+
+                    except Exception as e:
+                        print(f"Error processing {json_file}: {e}")
+                        continue
+
+            print(f"Exported {count} chat entries to {output_file}")
+
+        except Exception as e:
+            print(f"Error exporting WeChat chat history: {e}")
+
+    def export_wechat_chat_history(self, export_dir: str = "./wechat_export_direct") -> Path | None:
+        """
+        Export WeChat chat history using wechat-exporter tool.
+
+        Args:
+            export_dir: Directory to save exported chat history
+
+        Returns:
+            Path to export directory if successful, None otherwise
+        """
+        try:
+            import subprocess
+            import sys
+
+            # Create export directory
+            export_path = Path(export_dir)
+            export_path.mkdir(exist_ok=True)
+
+            print(f"Exporting WeChat chat history to {export_path}...")
+
+            # Check if wechat-exporter directory exists
+            if not self.wechat_exporter_dir.exists():
+                print(f"wechat-exporter directory not found at: {self.wechat_exporter_dir}")
+                return None
+
+            # Install requirements if needed
+            requirements_file = self.wechat_exporter_dir / "requirements.txt"
+            if requirements_file.exists():
+                print("Installing wechat-exporter requirements...")
+                subprocess.run(["uv", "pip", "install", "-r", str(requirements_file)], check=True)
+
+            # Run the export command
+            print("Running wechat-exporter...")
+            result = subprocess.run(
+                [
+                    sys.executable,
+                    str(self.wechat_exporter_dir / "main.py"),
+                    "export-all",
+                    str(export_path),
+                ],
+                capture_output=True,
+                text=True,
+                check=True,
+            )
+
+            print("Export command output:")
+            print(result.stdout)
+            if result.stderr:
+                print("Export errors:")
+                print(result.stderr)
+
+            # Check if export was successful
+            if export_path.exists() and any(export_path.glob("*.json")):
+                json_files = list(export_path.glob("*.json"))
+                print(
+                    f"Successfully exported {len(json_files)} chat history files to {export_path}"
+                )
+                return export_path
+            else:
+                print("Export completed but no JSON files found")
+                return None
+
+        except subprocess.CalledProcessError as e:
+            print(f"Export command failed: {e}")
+            print(f"Command output: {e.stdout}")
+            print(f"Command errors: {e.stderr}")
+            return None
+        except Exception as e:
+            print(f"Export failed: {e}")
+            print("Please ensure WeChat is running and WeChatTweak is installed.")
+            return None
+
+    def find_or_export_wechat_data(self, export_dir: str = "./wechat_export_direct") -> list[Path]:
+        """
+        Find existing WeChat exports or create new ones.
+
+        Args:
+            export_dir: Directory to save exported chat history if needed
+
+        Returns:
+            List of Path objects pointing to WeChat export directories
+        """
+        export_dirs = []
+
+        # Look for existing exports in common locations
+        possible_export_dirs = [
+            Path("./wechat_database_export"),
+            Path("./wechat_export_test"),
+            Path("./wechat_export"),
+            Path("./wechat_export_direct"),
+            Path("./wechat_chat_history"),
+            Path("./chat_export"),
+        ]
+
+        for export_dir_path in possible_export_dirs:
+            if export_dir_path.exists() and any(export_dir_path.glob("*.json")):
+                export_dirs.append(export_dir_path)
+                print(f"Found existing export: {export_dir_path}")
+
+        # If no existing exports, try to export automatically
+        if not export_dirs:
+            print("No existing WeChat exports found. Starting direct export...")
+
+            # Try to export using wechat-exporter
+            exported_path = self.export_wechat_chat_history(export_dir)
+            if exported_path:
+                export_dirs = [exported_path]
+            else:
+                print(
+                    "Failed to export WeChat data. Please ensure WeChat is running and WeChatTweak is installed."
+                )
+
+        return export_dirs