From 88997349528acfe26b96d30f67214a933623fe38 Mon Sep 17 00:00:00 2001 From: Andy Lee Date: Sun, 3 Aug 2025 23:06:24 -0700 Subject: [PATCH] refactor: Unify examples interface with BaseRAGExample (#12) * refactor: Unify examples interface with BaseRAGExample - Create BaseRAGExample base class for all RAG examples - Refactor 4 examples to use unified interface: - document_rag.py (replaces main_cli_example.py) - email_rag.py (replaces mail_reader_leann.py) - browser_rag.py (replaces google_history_reader_leann.py) - wechat_rag.py (replaces wechat_history_reader_leann.py) - Maintain 100% parameter compatibility with original files - Add interactive mode support for all examples - Unify parameter names (--max-items replaces --max-emails/--max-entries) - Update README.md with new examples usage - Add PARAMETER_CONSISTENCY.md documenting all parameter mappings - Keep main_cli_example.py for backward compatibility with migration notice All default values, LeannBuilder parameters, and chunking settings remain identical to ensure full compatibility with existing indexes. * fix: Update CI tests for new unified examples interface - Rename test_main_cli.py to test_document_rag.py - Update all references from main_cli_example.py to document_rag.py - Update tests/README.md documentation The tests now properly test the new unified interface while maintaining the same test coverage and functionality. * fix: Fix pre-commit issues and update tests - Fix import sorting and unused imports - Update type annotations to use built-in types (list, dict) instead of typing.List/Dict - Fix trailing whitespace and end-of-file issues - Fix Chinese fullwidth comma to regular comma - Update test_main_cli.py to test_document_rag.py - Add backward compatibility test for main_cli_example.py - Pass all pre-commit hooks (ruff, ruff-format, etc.) * refactor: Remove old example scripts and migration references - Delete old example scripts (mail_reader_leann.py, google_history_reader_leann.py, etc.) - Remove migration hints and backward compatibility - Update tests to use new unified examples directly - Clean up all references to old script names - Users now only see the new unified interface * fix: Restore embedding-mode parameter to all examples - All examples now have --embedding-mode parameter (unified interface benefit) - Default is 'sentence-transformers' (consistent with original behavior) - Users can now use OpenAI or MLX embeddings with any data source - Maintains functional equivalence with original scripts * docs: Improve parameter categorization in README - Clearly separate core (shared) vs specific parameters - Move LLM and embedding examples to 'Example Commands' section - Add descriptive comments for all specific parameters - Keep only truly data-source-specific parameters in specific sections * docs: Make example commands more representative - Add default values to parameter descriptions - Replace generic examples with real-world use cases - Focus on data-source-specific features in examples - Remove redundant demonstrations of common parameters * docs: Reorganize parameter documentation structure - Move common parameters to a dedicated section before all examples - Rename sections to 'X-Specific Arguments' for clarity - Remove duplicate common parameters from individual examples - Better information architecture for users * docs: polish applications * docs: Add CLI installation instructions - Add two installation options: venv and global uv tool - Clearly explain when to use each option - Make CLI more accessible for daily use * docs: Clarify CLI global installation process - Explain the transition from venv to global installation - Add upgrade command for global installation - Make it clear that global install allows usage without venv activation * docs: Add collapsible section for CLI installation - Wrap CLI installation instructions in details/summary tags - Keep consistent with other collapsible sections in README - Improve document readability and navigation * style: format * docs: Fix collapsible sections - Make Common Parameters collapsible (as it's lengthy reference material) - Keep CLI Installation visible (important for users to see immediately) - Better information hierarchy * docs: Add introduction for Common Parameters section - Add 'Flexible Configuration' heading with descriptive sentence - Create parallel structure with 'Generation Model Setup' section - Improve document flow and readability * docs: nit * fix: Fix issues in unified examples - Add smart path detection for data directory - Fix add_texts -> add_text method call - Handle both running from project root and examples directory * fix: Fix async/await and add_text issues in unified examples - Remove incorrect await from chat.ask() calls (not async) - Fix add_texts -> add_text method calls - Verify search-complexity correctly maps to efSearch parameter - All examples now run successfully * feat: Address review comments - Add complexity parameter to LeannChat initialization (default: search_complexity) - Fix chunk-size default in README documentation (256, not 2048) - Add more index building parameters as CLI arguments: - --backend-name (hnsw/diskann) - --graph-degree (default: 32) - --build-complexity (default: 64) - --no-compact (disable compact storage) - --no-recompute (disable embedding recomputation) - Update README to document all new parameters * feat: Add chunk-size parameters and improve file type filtering - Add --chunk-size and --chunk-overlap parameters to all RAG examples - Preserve original default values for each data source: - Document: 256/128 (optimized for general documents) - Email: 256/25 (smaller overlap for email threads) - Browser: 256/128 (standard for web content) - WeChat: 192/64 (smaller chunks for chat messages) - Make --file-types optional filter instead of restriction in document_rag - Update README to clarify interactive mode and parameter usage - Fix LLM default model documentation (gpt-4o, not gpt-4o-mini) * feat: Update documentation based on review feedback - Add MLX embedding example to README - Clarify examples/data content description (two papers, Pride and Prejudice, Chinese README) - Move chunk parameters to common parameters section - Remove duplicate chunk parameters from document-specific section * docs: Emphasize diverse data sources in examples/data description * fix: update default embedding models for better performance - Change WeChat, Browser, and Email RAG examples to use all-MiniLM-L6-v2 - Previous Qwen/Qwen3-Embedding-0.6B was too slow for these use cases - all-MiniLM-L6-v2 is a fast 384-dim model, ideal for large-scale personal data * add response highlight * change rebuild logic * fix some example * feat: check if k is larger than #docs * fix: WeChat history reader bugs and refactor wechat_rag to use unified architecture * fix email wrong -1 to process all file * refactor: reorgnize all examples/ and test/ * refactor: reorganize examples and add link checker * fix: add init.py * fix: handle certificate errors in link checker * fix wechat * merge * docs: update README to use proper module imports for apps - Change from 'python apps/xxx.py' to 'python -m apps.xxx' - More professional and pythonic module calling - Ensures proper module resolution and imports - Better separation between apps/ (production tools) and examples/ (demos) --------- Co-authored-by: yichuan520030910320 --- .github/workflows/link-check.yml | 19 + .gitignore | 14 +- README.md | 238 ++++++++---- apps/__init__.py | 0 apps/base_rag_example.py | 296 ++++++++++++++ apps/browser_rag.py | 170 ++++++++ apps/document_rag.py | 106 +++++ .../email_data/LEANN_email_reader.py | 66 +++- {examples => apps}/email_data/email.py | 0 apps/email_rag.py | 156 ++++++++ {examples => apps}/history_data/__init__.py | 0 {examples => apps}/history_data/history.py | 5 + .../history_data/wechat_history.py | 16 +- apps/wechat_rag.py | 189 +++++++++ {test/sanity_checks => benchmarks}/README.md | 0 .../benchmark_embeddings.py | 0 .../compare_faiss_vs_leann.py | 4 +- {examples => benchmarks}/faiss_only.py | 2 +- {test => benchmarks}/micro_tpt.py | 0 {examples => benchmarks}/run_evaluation.py | 12 +- {test => benchmarks}/simple_mac_tpt_test.py | 0 {examples/data => data}/2501.14312v1 (1).pdf | Bin {examples/data => data}/2506.08276v1.pdf | 0 {examples/data => data}/PrideandPrejudice.txt | 0 data/README.md | 110 ++++-- docs/features.md | 2 +- docs/normalized_embeddings.md | 2 +- examples/{simple_demo.py => basic_demo.py} | 4 +- examples/data/README.md | 82 ---- examples/document_search.py | 158 -------- examples/google_history_reader_leann.py | 362 ------------------ examples/mail_reader_leann.py | 342 ----------------- examples/mail_reader_llamaindex.py | 135 ------- examples/main_cli_example.py | 146 ------- .../mlx_demo.py | 0 examples/multi_vector_aggregator.py | 360 ----------------- examples/openai_hnsw_example.py | 113 ------ examples/resue_index.py | 23 -- examples/wechat_history_reader_leann.py | 320 ---------------- packages/leann-core/src/leann/api.py | 19 +- pyproject.toml | 8 + test/mail_reader_llamaindex.py | 161 -------- test/mail_reader_save_load.py | 219 ----------- test/mail_reader_small_chunks.py | 219 ----------- test/mail_reader_test.py | 154 -------- test/query_saved_index.py | 105 ----- test/sanity_checks/debug_zmq_issue.py | 117 ------ tests/README.md | 4 +- tests/test_ci_minimal.py | 2 +- ...{test_main_cli.py => test_document_rag.py} | 26 +- 50 files changed, 1293 insertions(+), 3193 deletions(-) create mode 100644 .github/workflows/link-check.yml create mode 100644 apps/__init__.py create mode 100644 apps/base_rag_example.py create mode 100644 apps/browser_rag.py create mode 100644 apps/document_rag.py rename {examples => apps}/email_data/LEANN_email_reader.py (58%) rename {examples => apps}/email_data/email.py (100%) create mode 100644 apps/email_rag.py rename {examples => apps}/history_data/__init__.py (100%) rename {examples => apps}/history_data/history.py (95%) rename {examples => apps}/history_data/wechat_history.py (98%) create mode 100644 apps/wechat_rag.py rename {test/sanity_checks => benchmarks}/README.md (100%) rename {test/sanity_checks => benchmarks}/benchmark_embeddings.py (100%) rename {examples => benchmarks}/compare_faiss_vs_leann.py (99%) rename {examples => benchmarks}/faiss_only.py (99%) rename {test => benchmarks}/micro_tpt.py (100%) rename {examples => benchmarks}/run_evaluation.py (97%) rename {test => benchmarks}/simple_mac_tpt_test.py (100%) rename {examples/data => data}/2501.14312v1 (1).pdf (100%) rename {examples/data => data}/2506.08276v1.pdf (100%) rename {examples/data => data}/PrideandPrejudice.txt (100%) rename examples/{simple_demo.py => basic_demo.py} (96%) delete mode 100644 examples/data/README.md delete mode 100644 examples/document_search.py delete mode 100644 examples/google_history_reader_leann.py delete mode 100644 examples/mail_reader_leann.py delete mode 100644 examples/mail_reader_llamaindex.py delete mode 100644 examples/main_cli_example.py rename test/build_mlx_index.py => examples/mlx_demo.py (100%) delete mode 100644 examples/multi_vector_aggregator.py delete mode 100644 examples/openai_hnsw_example.py delete mode 100644 examples/resue_index.py delete mode 100644 examples/wechat_history_reader_leann.py delete mode 100644 test/mail_reader_llamaindex.py delete mode 100644 test/mail_reader_save_load.py delete mode 100644 test/mail_reader_small_chunks.py delete mode 100644 test/mail_reader_test.py delete mode 100644 test/query_saved_index.py delete mode 100644 test/sanity_checks/debug_zmq_issue.py rename tests/{test_main_cli.py => test_document_rag.py} (82%) diff --git a/.github/workflows/link-check.yml b/.github/workflows/link-check.yml new file mode 100644 index 0000000..87a8077 --- /dev/null +++ b/.github/workflows/link-check.yml @@ -0,0 +1,19 @@ +name: Link Check + +on: + push: + branches: [ main, master ] + pull_request: + schedule: + - cron: "0 3 * * 1" + +jobs: + link-check: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: lycheeverse/lychee-action@v2 + with: + args: --no-progress --insecure README.md docs/ apps/ examples/ benchmarks/ + env: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} diff --git a/.gitignore b/.gitignore index 3d49d4d..4b51945 100755 --- a/.gitignore +++ b/.gitignore @@ -34,11 +34,15 @@ build/ nprobe_logs/ micro/results micro/contriever-INT8 -examples/data/* -!examples/data/2501.14312v1 (1).pdf -!examples/data/2506.08276v1.pdf -!examples/data/PrideandPrejudice.txt -!examples/data/README.md +data/* +!data/2501.14312v1 (1).pdf +!data/2506.08276v1.pdf +!data/PrideandPrejudice.txt +!data/README.md +!data/ground_truth/ +!data/indices/ +!data/queries/ +!data/.gitattributes *.qdstrm benchmark_results/ results/ diff --git a/README.md b/README.md index a5b7280..332c11c 100755 --- a/README.md +++ b/README.md @@ -41,40 +41,40 @@ LEANN achieves this through *graph-based selective recomputation* with *high-deg ## Installation -
-๐Ÿ“ฆ Prerequisites: Install uv (if you don't have it) +### ๐Ÿ“ฆ Prerequisites: Install uv -Install uv first if you don't have it: +[Install uv](https://docs.astral.sh/uv/getting-started/installation/#installation-methods) first if you don't have it. Typically, you can install it with: ```bash curl -LsSf https://astral.sh/uv/install.sh | sh ``` -๐Ÿ“– [Detailed uv installation methods โ†’](https://docs.astral.sh/uv/getting-started/installation/#installation-methods) +### ๐Ÿš€ Quick Install -
- - -LEANN provides two installation methods: **pip install** (quick and easy) and **build from source** (recommended for development). - - - -### ๐Ÿš€ Quick Install (Recommended for most users) - -Clone the repository to access all examples and install LEANN from [PyPI](https://pypi.org/project/leann/) to run them immediately: +Clone the repository to access all examples and try amazing applications, ```bash -git clone git@github.com:yichuan-w/LEANN.git leann +git clone https://github.com/yichuan-w/LEANN.git leann cd leann +``` + +and install LEANN from [PyPI](https://pypi.org/project/leann/) to run them immediately: + +```bash uv venv source .venv/bin/activate uv pip install leann ``` -### ๐Ÿ”ง Build from Source (Recommended for development) +
+ +๐Ÿ”ง Build from Source (Recommended for development) + + + ```bash -git clone git@github.com:yichuan-w/LEANN.git leann +git clone https://github.com/yichuan-w/LEANN.git leann cd leann git submodule update --init --recursive ``` @@ -91,14 +91,14 @@ sudo apt-get install libomp-dev libboost-all-dev protobuf-compiler libabsl-dev l uv sync ``` - +
## Quick Start Our declarative API makes RAG as easy as writing a config file. -[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yichuan-w/LEANN/blob/main/demo.ipynb) [Try in this ipynb file โ†’](demo.ipynb) +Check out [demo.ipynb](demo.ipynb) or [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yichuan-w/LEANN/blob/main/demo.ipynb) ```python from leann import LeannBuilder, LeannSearcher, LeannChat @@ -122,11 +122,11 @@ response = chat.ask("How much storage does LEANN save?", top_k=1) ## RAG on Everything! -LEANN supports RAG on various data sources including documents (.pdf, .txt, .md), Apple Mail, Google Search History, WeChat, and more. +LEANN supports RAG on various data sources including documents (`.pdf`, `.txt`, `.md`), Apple Mail, Google Search History, WeChat, and more. +### Generation Model Setup -> **Generation Model Setup** -> LEANN supports multiple LLM providers for text generation (OpenAI API, HuggingFace, Ollama). +LEANN supports multiple LLM providers for text generation (OpenAI API, HuggingFace, Ollama).
๐Ÿ”‘ OpenAI API Setup (Default) @@ -166,7 +166,49 @@ ollama pull llama3.2:1b
-### ๐Ÿ“„ Personal Data Manager: Process Any Documents (.pdf, .txt, .md)! +### Flexible Configuration + +LEANN provides flexible parameters for embedding models, search strategies, and data processing to fit your specific needs. + +
+๐Ÿ“‹ Click to expand: Common Parameters (Available in All Examples) + +All RAG examples share these common parameters. **Interactive mode** is available in all examples - simply run without `--query` to start a continuous Q&A session where you can ask multiple questions. Type 'quit' to exit. + +```bash +# Core Parameters (General preprocessing for all examples) +--index-dir DIR # Directory to store the index (default: current directory) +--query "YOUR QUESTION" # Single query mode. Omit for interactive chat (type 'quit' to exit), and now you can play with your index interactively +--max-items N # Limit data preprocessing (default: -1, process all data) +--force-rebuild # Force rebuild index even if it exists + +# Embedding Parameters +--embedding-model MODEL # e.g., facebook/contriever, text-embedding-3-small or mlx-community/multilingual-e5-base-mlx +--embedding-mode MODE # sentence-transformers, openai, or mlx + +# LLM Parameters (Text generation models) +--llm TYPE # LLM backend: openai, ollama, or hf (default: openai) +--llm-model MODEL # Model name (default: gpt-4o) e.g., gpt-4o-mini, llama3.2:1b, Qwen/Qwen2.5-1.5B-Instruct + +# Search Parameters +--top-k N # Number of results to retrieve (default: 20) +--search-complexity N # Search complexity for graph traversal (default: 32) + +# Chunking Parameters +--chunk-size N # Size of text chunks (default varies by source: 256 for most, 192 for WeChat) +--chunk-overlap N # Overlap between chunks (default varies: 25-128 depending on source) + +# Index Building Parameters +--backend-name NAME # Backend to use: hnsw or diskann (default: hnsw) +--graph-degree N # Graph degree for index construction (default: 32) +--build-complexity N # Build complexity for index construction (default: 64) +--no-compact # Disable compact index storage (compact storage IS enabled to save storage by default) +--no-recompute # Disable embedding recomputation (recomputation IS enabled to save storage by default) +``` + +
+ +### ๐Ÿ“„ Personal Data Manager: Process Any Documents (`.pdf`, `.txt`, `.md`)! Ask questions directly about your personal PDFs, documents, and any directory containing your files! @@ -174,25 +216,29 @@ Ask questions directly about your personal PDFs, documents, and any directory co LEANN Document Search Demo

-The example below asks a question about summarizing two papers (uses default data in `examples/data`) and this is the easiest example to run here: +The example below asks a question about summarizing our paper (uses default data in `data/`, which is a directory with diverse data sources: two papers, Pride and Prejudice, and a README in Chinese) and this is the **easiest example** to run here: ```bash -source .venv/bin/activate -python ./examples/main_cli_example.py +source .venv/bin/activate # Don't forget to activate the virtual environment +python -m apps.document_rag --query "What are the main techniques LEANN explores?" ```
-๐Ÿ“‹ Click to expand: User Configurable Arguments +๐Ÿ“‹ Click to expand: Document-Specific Arguments +#### Parameters ```bash -# Use custom index directory -python examples/main_cli_example.py --index-dir "./my_custom_index" +--data-dir DIR # Directory containing documents to process (default: data) +--file-types .ext .ext # Filter by specific file types (optional - all LlamaIndex supported types if omitted) +``` -# Use custom data directory -python examples/main_cli_example.py --data-dir "./my_documents" +#### Example Commands +```bash +# Process all documents with larger chunks for academic papers +python -m apps.document_rag --data-dir "~/Documents/Papers" --chunk-size 1024 -# Ask a specific question -python examples/main_cli_example.py --query "What are the main findings in these papers?" +# Filter only markdown and Python files with smaller chunks +python -m apps.document_rag --data-dir "./docs" --chunk-size 256 --file-types .md .py ```
@@ -206,30 +252,29 @@ python examples/main_cli_example.py --query "What are the main findings in these LEANN Email Search Demo

-**Note:** You need to grant full disk access to your terminal/VS Code in System Preferences โ†’ Privacy & Security โ†’ Full Disk Access. +Before running the example below, you need to grant full disk access to your terminal/VS Code in System Preferences โ†’ Privacy & Security โ†’ Full Disk Access. + ```bash -python examples/mail_reader_leann.py --query "What's the food I ordered by DoorDash or Uber Eats mostly?" +python -m apps.email_rag --query "What's the food I ordered by DoorDash or Uber Eats mostly?" ``` **780K email chunks โ†’ 78MB storage.** Finally, search your email like you search Google.
-๐Ÿ“‹ Click to expand: User Configurable Arguments +๐Ÿ“‹ Click to expand: Email-Specific Arguments +#### Parameters ```bash -# Use default mail path (works for most macOS setups) -python examples/mail_reader_leann.py +--mail-path PATH # Path to specific mail directory (auto-detects if omitted) +--include-html # Include HTML content in processing (useful for newsletters) +``` -# Run with custom index directory -python examples/mail_reader_leann.py --index-dir "./my_mail_index" +#### Example Commands +```bash +# Search work emails from a specific account +python -m apps.email_rag --mail-path "~/Library/Mail/V10/WORK_ACCOUNT" -# Process all emails (may take time but indexes everything) -python examples/mail_reader_leann.py --max-emails -1 - -# Limit number of emails processed (useful for testing) -python examples/mail_reader_leann.py --max-emails 1000 - -# Run a single query -python examples/mail_reader_leann.py --query "What did my boss say about deadlines?" +# Find all receipts and order confirmations (includes HTML) +python -m apps.email_rag --query "receipt order confirmation invoice" --include-html ```
@@ -250,25 +295,25 @@ Once the index is built, you can ask questions like:

```bash -python examples/google_history_reader_leann.py --query "Tell me my browser history about machine learning?" +python -m apps.browser_rag --query "Tell me my browser history about machine learning?" ``` **38K browser entries โ†’ 6MB storage.** Your browser history becomes your personal search engine.
-๐Ÿ“‹ Click to expand: User Configurable Arguments +๐Ÿ“‹ Click to expand: Browser-Specific Arguments +#### Parameters ```bash -# Use default Chrome profile (auto-finds all profiles) -python examples/google_history_reader_leann.py +--chrome-profile PATH # Path to Chrome profile directory (auto-detects if omitted) +``` -# Run with custom index directory -python examples/google_history_reader_leann.py --index-dir "./my_chrome_index" +#### Example Commands +```bash +# Search academic research from your browsing history +python -m apps.browser_rag --query "arxiv papers machine learning transformer architecture" -# Limit number of history entries processed (useful for testing) -python examples/google_history_reader_leann.py --max-entries 500 - -# Run a single query -python examples/google_history_reader_leann.py --query "What websites did I visit about machine learning?" +# Track competitor analysis across work profile +python -m apps.browser_rag --chrome-profile "~/Library/Application Support/Google/Chrome/Work Profile" --max-items 5000 ```
@@ -308,7 +353,7 @@ Once the index is built, you can ask questions like:

```bash -python examples/wechat_history_reader_leann.py --query "Show me all group chats about weekend plans" +python -m apps.wechat_rag --query "Show me all group chats about weekend plans" ``` **400K messages โ†’ 64MB storage** Search years of chat history in any language. @@ -316,7 +361,13 @@ python examples/wechat_history_reader_leann.py --query "Show me all group chats
๐Ÿ”ง Click to expand: Installation Requirements -First, you need to install the WeChat exporter: +First, you need to install the [WeChat exporter](https://github.com/sunnyyoung/WeChatTweak-CLI), + +```bash +brew install sunnyyoung/repo/wechattweak-cli +``` + +or install it manually (if you have issues with Homebrew): ```bash sudo packages/wechat-exporter/wechattweak-cli install @@ -325,30 +376,28 @@ sudo packages/wechat-exporter/wechattweak-cli install **Troubleshooting:** - **Installation issues**: Check the [WeChatTweak-CLI issues page](https://github.com/sunnyyoung/WeChatTweak-CLI/issues/41) - **Export errors**: If you encounter the error below, try restarting WeChat -``` -Failed to export WeChat data. Please ensure WeChat is running and WeChatTweak is installed. -Failed to find or export WeChat data. Exiting. -``` + ```bash + Failed to export WeChat data. Please ensure WeChat is running and WeChatTweak is installed. + Failed to find or export WeChat data. Exiting. + ```
-๐Ÿ“‹ Click to expand: User Configurable Arguments +๐Ÿ“‹ Click to expand: WeChat-Specific Arguments +#### Parameters ```bash -# Use default settings (recommended for first run) -python examples/wechat_history_reader_leann.py +--export-dir DIR # Directory to store exported WeChat data (default: wechat_export_direct) +--force-export # Force re-export even if data exists +``` -# Run with custom export directory and wehn we run the first time, LEANN will export all chat history automatically for you -python examples/wechat_history_reader_leann.py --export-dir "./my_wechat_exports" +#### Example Commands +```bash +# Search for travel plans discussed in group chats +python -m apps.wechat_rag --query "travel plans" --max-items 10000 -# Run with custom index directory -python examples/wechat_history_reader_leann.py --index-dir "./my_wechat_index" - -# Limit number of chat entries processed (useful for testing) -python examples/wechat_history_reader_leann.py --max-entries 1000 - -# Run a single query -python examples/wechat_history_reader_leann.py --query "Show me conversations about travel plans" +# Re-export and search recent chats (useful after new messages) +python -m apps.wechat_rag --force-export --query "work schedule" ```
@@ -368,6 +417,27 @@ Once the index is built, you can ask questions like: LEANN includes a powerful CLI for document processing and search. Perfect for quick document indexing and interactive chat. +### Installation + +If you followed the Quick Start, `leann` is already installed in your virtual environment: +```bash +source .venv/bin/activate +leann --help +``` + +**To make it globally available (recommended for daily use):** +```bash +# Install the LEANN CLI globally using uv tool +uv tool install leann + +# Now you can use leann from anywhere without activating venv +leann --help +``` + + + +### Usage Examples + ```bash # Build an index from documents leann build my-docs --docs ./documents @@ -449,8 +519,8 @@ Options: ## Benchmarks -๐Ÿ“Š **[Simple Example: Compare LEANN vs FAISS โ†’](examples/compare_faiss_vs_leann.py)** -### Storage Comparison +**[Simple Example: Compare LEANN vs FAISS โ†’](benchmarks/compare_faiss_vs_leann.py)** +### ๐Ÿ“Š Storage Comparison | System | DPR (2.1M) | Wiki (60M) | Chat (400K) | Email (780K) | Browser (38K) | |--------|-------------|------------|-------------|--------------|---------------| @@ -464,8 +534,8 @@ Options: ```bash uv pip install -e ".[dev]" # Install dev dependencies -python examples/run_evaluation.py data/indices/dpr/dpr_diskann # DPR dataset -python examples/run_evaluation.py data/indices/rpj_wiki/rpj_wiki.index # Wikipedia +python benchmarks/run_evaluation.py data/indices/dpr/dpr_diskann # DPR dataset +python benchmarks/run_evaluation.py data/indices/rpj_wiki/rpj_wiki.index # Wikipedia ``` The evaluation script downloads data automatically on first run. The last three results were tested with partial personal data, and you can reproduce them with your own data! @@ -503,7 +573,9 @@ MIT License - see [LICENSE](LICENSE) for details. ## ๐Ÿ™ Acknowledgments -This work is done atย [**Berkeley Sky Computing Lab**](https://sky.cs.berkeley.edu/) +This work is done atย [**Berkeley Sky Computing Lab**](https://sky.cs.berkeley.edu/). + + ---

diff --git a/apps/__init__.py b/apps/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/apps/base_rag_example.py b/apps/base_rag_example.py new file mode 100644 index 0000000..a135625 --- /dev/null +++ b/apps/base_rag_example.py @@ -0,0 +1,296 @@ +""" +Base class for unified RAG examples interface. +Provides common parameters and functionality for all RAG examples. +""" + +import argparse +from abc import ABC, abstractmethod +from pathlib import Path +from typing import Any + +import dotenv +from leann.api import LeannBuilder, LeannChat +from llama_index.core.node_parser import SentenceSplitter + +dotenv.load_dotenv() + + +class BaseRAGExample(ABC): + """Base class for all RAG examples with unified interface.""" + + def __init__( + self, + name: str, + description: str, + default_index_name: str, + ): + self.name = name + self.description = description + self.default_index_name = default_index_name + self.parser = self._create_parser() + + def _create_parser(self) -> argparse.ArgumentParser: + """Create argument parser with common parameters.""" + parser = argparse.ArgumentParser( + description=self.description, formatter_class=argparse.RawDescriptionHelpFormatter + ) + + # Core parameters (all examples share these) + core_group = parser.add_argument_group("Core Parameters") + core_group.add_argument( + "--index-dir", + type=str, + default=f"./{self.default_index_name}", + help=f"Directory to store the index (default: ./{self.default_index_name})", + ) + core_group.add_argument( + "--query", + type=str, + default=None, + help="Query to run (if not provided, will run in interactive mode)", + ) + # Allow subclasses to override default max_items + max_items_default = getattr(self, "max_items_default", -1) + core_group.add_argument( + "--max-items", + type=int, + default=max_items_default, + help="Maximum number of items to process -1 for all, means index all documents, and you should set it to a reasonable number if you have a large dataset and try at the first time)", + ) + core_group.add_argument( + "--force-rebuild", action="store_true", help="Force rebuild index even if it exists" + ) + + # Embedding parameters + embedding_group = parser.add_argument_group("Embedding Parameters") + # Allow subclasses to override default embedding_model + embedding_model_default = getattr(self, "embedding_model_default", "facebook/contriever") + embedding_group.add_argument( + "--embedding-model", + type=str, + default=embedding_model_default, + help=f"Embedding model to use (default: {embedding_model_default})", + ) + embedding_group.add_argument( + "--embedding-mode", + type=str, + default="sentence-transformers", + choices=["sentence-transformers", "openai", "mlx"], + help="Embedding backend mode (default: sentence-transformers)", + ) + + # LLM parameters + llm_group = parser.add_argument_group("LLM Parameters") + llm_group.add_argument( + "--llm", + type=str, + default="openai", + choices=["openai", "ollama", "hf"], + help="LLM backend to use (default: openai)", + ) + llm_group.add_argument( + "--llm-model", + type=str, + default=None, + help="LLM model name (default: gpt-4o for openai, llama3.2:1b for ollama)", + ) + llm_group.add_argument( + "--llm-host", + type=str, + default="http://localhost:11434", + help="Host for Ollama API (default: http://localhost:11434)", + ) + + # Search parameters + search_group = parser.add_argument_group("Search Parameters") + search_group.add_argument( + "--top-k", type=int, default=20, help="Number of results to retrieve (default: 20)" + ) + search_group.add_argument( + "--search-complexity", + type=int, + default=32, + help="Search complexity for graph traversal (default: 64)", + ) + + # Index building parameters + index_group = parser.add_argument_group("Index Building Parameters") + index_group.add_argument( + "--backend-name", + type=str, + default="hnsw", + choices=["hnsw", "diskann"], + help="Backend to use for index (default: hnsw)", + ) + index_group.add_argument( + "--graph-degree", + type=int, + default=32, + help="Graph degree for index construction (default: 32)", + ) + index_group.add_argument( + "--build-complexity", + type=int, + default=64, + help="Build complexity for index construction (default: 64)", + ) + index_group.add_argument( + "--no-compact", + action="store_true", + help="Disable compact index storage", + ) + index_group.add_argument( + "--no-recompute", + action="store_true", + help="Disable embedding recomputation", + ) + + # Add source-specific parameters + self._add_specific_arguments(parser) + + return parser + + @abstractmethod + def _add_specific_arguments(self, parser: argparse.ArgumentParser): + """Add source-specific arguments. Override in subclasses.""" + pass + + @abstractmethod + async def load_data(self, args) -> list[str]: + """Load data from the source. Returns list of text chunks.""" + pass + + def get_llm_config(self, args) -> dict[str, Any]: + """Get LLM configuration based on arguments.""" + config = {"type": args.llm} + + if args.llm == "openai": + config["model"] = args.llm_model or "gpt-4o" + elif args.llm == "ollama": + config["model"] = args.llm_model or "llama3.2:1b" + config["host"] = args.llm_host + elif args.llm == "hf": + config["model"] = args.llm_model or "Qwen/Qwen2.5-1.5B-Instruct" + + return config + + async def build_index(self, args, texts: list[str]) -> str: + """Build LEANN index from texts.""" + index_path = str(Path(args.index_dir) / f"{self.default_index_name}.leann") + + print(f"\n[Building Index] Creating {self.name} index...") + print(f"Total text chunks: {len(texts)}") + + builder = LeannBuilder( + backend_name=args.backend_name, + embedding_model=args.embedding_model, + embedding_mode=args.embedding_mode, + graph_degree=args.graph_degree, + complexity=args.build_complexity, + is_compact=not args.no_compact, + is_recompute=not args.no_recompute, + num_threads=1, # Force single-threaded mode + ) + + # Add texts in batches for better progress tracking + batch_size = 1000 + for i in range(0, len(texts), batch_size): + batch = texts[i : i + batch_size] + for text in batch: + builder.add_text(text) + print(f"Added {min(i + batch_size, len(texts))}/{len(texts)} texts...") + + print("Building index structure...") + builder.build_index(index_path) + print(f"Index saved to: {index_path}") + + return index_path + + async def run_interactive_chat(self, args, index_path: str): + """Run interactive chat with the index.""" + chat = LeannChat( + index_path, + llm_config=self.get_llm_config(args), + system_prompt=f"You are a helpful assistant that answers questions about {self.name} data.", + complexity=args.search_complexity, + ) + + print(f"\n[Interactive Mode] Chat with your {self.name} data!") + print("Type 'quit' or 'exit' to stop.\n") + + while True: + try: + query = input("You: ").strip() + if query.lower() in ["quit", "exit", "q"]: + print("Goodbye!") + break + + if not query: + continue + + response = chat.ask(query, top_k=args.top_k, complexity=args.search_complexity) + print(f"\nAssistant: {response}\n") + + except KeyboardInterrupt: + print("\nGoodbye!") + break + except Exception as e: + print(f"Error: {e}") + + async def run_single_query(self, args, index_path: str, query: str): + """Run a single query against the index.""" + chat = LeannChat( + index_path, + llm_config=self.get_llm_config(args), + system_prompt=f"You are a helpful assistant that answers questions about {self.name} data.", + complexity=args.search_complexity, + ) + + print(f"\n[Query]: \033[36m{query}\033[0m") + response = chat.ask(query, top_k=args.top_k, complexity=args.search_complexity) + print(f"\n[Response]: \033[36m{response}\033[0m") + + async def run(self): + """Main entry point for the example.""" + args = self.parser.parse_args() + + # Check if index exists + index_path = str(Path(args.index_dir) / f"{self.default_index_name}.leann") + index_exists = Path(args.index_dir).exists() + + if not index_exists or args.force_rebuild: + # Load data and build index + print(f"\n{'Rebuilding' if index_exists else 'Building'} index...") + texts = await self.load_data(args) + + if not texts: + print("No data found to index!") + return + + index_path = await self.build_index(args, texts) + else: + print(f"\nUsing existing index in {args.index_dir}") + + # Run query or interactive mode + if args.query: + await self.run_single_query(args, index_path, args.query) + else: + await self.run_interactive_chat(args, index_path) + + +def create_text_chunks(documents, chunk_size=256, chunk_overlap=25) -> list[str]: + """Helper function to create text chunks from documents.""" + node_parser = SentenceSplitter( + chunk_size=chunk_size, + chunk_overlap=chunk_overlap, + separator=" ", + paragraph_separator="\n\n", + ) + + all_texts = [] + for doc in documents: + nodes = node_parser.get_nodes_from_documents([doc]) + if nodes: + all_texts.extend(node.get_content() for node in nodes) + + return all_texts diff --git a/apps/browser_rag.py b/apps/browser_rag.py new file mode 100644 index 0000000..d115510 --- /dev/null +++ b/apps/browser_rag.py @@ -0,0 +1,170 @@ +""" +Browser History RAG example using the unified interface. +Supports Chrome browser history. +""" + +import os +import sys +from pathlib import Path + +# Add parent directory to path for imports +sys.path.insert(0, str(Path(__file__).parent)) + +from base_rag_example import BaseRAGExample, create_text_chunks + +from .history_data.history import ChromeHistoryReader + + +class BrowserRAG(BaseRAGExample): + """RAG example for Chrome browser history.""" + + def __init__(self): + # Set default values BEFORE calling super().__init__ + self.embedding_model_default = ( + "sentence-transformers/all-MiniLM-L6-v2" # Fast 384-dim model + ) + + super().__init__( + name="Browser History", + description="Process and query Chrome browser history with LEANN", + default_index_name="google_history_index", + ) + + def _add_specific_arguments(self, parser): + """Add browser-specific arguments.""" + browser_group = parser.add_argument_group("Browser Parameters") + browser_group.add_argument( + "--chrome-profile", + type=str, + default=None, + help="Path to Chrome profile directory (auto-detected if not specified)", + ) + browser_group.add_argument( + "--auto-find-profiles", + action="store_true", + default=True, + help="Automatically find all Chrome profiles (default: True)", + ) + browser_group.add_argument( + "--chunk-size", type=int, default=256, help="Text chunk size (default: 256)" + ) + browser_group.add_argument( + "--chunk-overlap", type=int, default=128, help="Text chunk overlap (default: 128)" + ) + + def _get_chrome_base_path(self) -> Path: + """Get the base Chrome profile path based on OS.""" + if sys.platform == "darwin": + return Path.home() / "Library" / "Application Support" / "Google" / "Chrome" + elif sys.platform.startswith("linux"): + return Path.home() / ".config" / "google-chrome" + elif sys.platform == "win32": + return Path(os.environ["LOCALAPPDATA"]) / "Google" / "Chrome" / "User Data" + else: + raise ValueError(f"Unsupported platform: {sys.platform}") + + def _find_chrome_profiles(self) -> list[Path]: + """Auto-detect all Chrome profiles.""" + base_path = self._get_chrome_base_path() + if not base_path.exists(): + return [] + + profiles = [] + + # Check Default profile + default_profile = base_path / "Default" + if default_profile.exists() and (default_profile / "History").exists(): + profiles.append(default_profile) + + # Check numbered profiles + for item in base_path.iterdir(): + if item.is_dir() and item.name.startswith("Profile "): + if (item / "History").exists(): + profiles.append(item) + + return profiles + + async def load_data(self, args) -> list[str]: + """Load browser history and convert to text chunks.""" + # Determine Chrome profiles + if args.chrome_profile and not args.auto_find_profiles: + profile_dirs = [Path(args.chrome_profile)] + else: + print("Auto-detecting Chrome profiles...") + profile_dirs = self._find_chrome_profiles() + + # If specific profile given, filter to just that one + if args.chrome_profile: + profile_path = Path(args.chrome_profile) + profile_dirs = [p for p in profile_dirs if p == profile_path] + + if not profile_dirs: + print("No Chrome profiles found!") + print("Please specify --chrome-profile manually") + return [] + + print(f"Found {len(profile_dirs)} Chrome profiles") + + # Create reader + reader = ChromeHistoryReader() + + # Process each profile + all_documents = [] + total_processed = 0 + + for i, profile_dir in enumerate(profile_dirs): + print(f"\nProcessing profile {i + 1}/{len(profile_dirs)}: {profile_dir.name}") + + try: + # Apply max_items limit per profile + max_per_profile = -1 + if args.max_items > 0: + remaining = args.max_items - total_processed + if remaining <= 0: + break + max_per_profile = remaining + + # Load history + documents = reader.load_data( + chrome_profile_path=str(profile_dir), + max_count=max_per_profile, + ) + + if documents: + all_documents.extend(documents) + total_processed += len(documents) + print(f"Processed {len(documents)} history entries from this profile") + + except Exception as e: + print(f"Error processing {profile_dir}: {e}") + continue + + if not all_documents: + print("No browser history found to process!") + return [] + + print(f"\nTotal history entries processed: {len(all_documents)}") + + # Convert to text chunks + all_texts = create_text_chunks( + all_documents, chunk_size=args.chunk_size, chunk_overlap=args.chunk_overlap + ) + + return all_texts + + +if __name__ == "__main__": + import asyncio + + # Example queries for browser history RAG + print("\n๐ŸŒ Browser History RAG Example") + print("=" * 50) + print("\nExample queries you can try:") + print("- 'What websites did I visit about machine learning?'") + print("- 'Find my search history about programming'") + print("- 'What YouTube videos did I watch recently?'") + print("- 'Show me websites about travel planning'") + print("\nNote: Make sure Chrome is closed before running\n") + + rag = BrowserRAG() + asyncio.run(rag.run()) diff --git a/apps/document_rag.py b/apps/document_rag.py new file mode 100644 index 0000000..02c954a --- /dev/null +++ b/apps/document_rag.py @@ -0,0 +1,106 @@ +""" +Document RAG example using the unified interface. +Supports PDF, TXT, MD, and other document formats. +""" + +import sys +from pathlib import Path + +# Add parent directory to path for imports +sys.path.insert(0, str(Path(__file__).parent)) + +from base_rag_example import BaseRAGExample, create_text_chunks +from llama_index.core import SimpleDirectoryReader + + +class DocumentRAG(BaseRAGExample): + """RAG example for document processing (PDF, TXT, MD, etc.).""" + + def __init__(self): + super().__init__( + name="Document", + description="Process and query documents (PDF, TXT, MD, etc.) with LEANN", + default_index_name="test_doc_files", + ) + + def _add_specific_arguments(self, parser): + """Add document-specific arguments.""" + doc_group = parser.add_argument_group("Document Parameters") + doc_group.add_argument( + "--data-dir", + type=str, + default="data", + help="Directory containing documents to index (default: data)", + ) + doc_group.add_argument( + "--file-types", + nargs="+", + default=None, + help="Filter by file types (e.g., .pdf .txt .md). If not specified, all supported types are processed", + ) + doc_group.add_argument( + "--chunk-size", type=int, default=256, help="Text chunk size (default: 256)" + ) + doc_group.add_argument( + "--chunk-overlap", type=int, default=128, help="Text chunk overlap (default: 128)" + ) + + async def load_data(self, args) -> list[str]: + """Load documents and convert to text chunks.""" + print(f"Loading documents from: {args.data_dir}") + if args.file_types: + print(f"Filtering by file types: {args.file_types}") + else: + print("Processing all supported file types") + + # Check if data directory exists + data_path = Path(args.data_dir) + if not data_path.exists(): + raise ValueError(f"Data directory not found: {args.data_dir}") + + # Load documents + reader_kwargs = { + "recursive": True, + "encoding": "utf-8", + } + if args.file_types: + reader_kwargs["required_exts"] = args.file_types + + documents = SimpleDirectoryReader(args.data_dir, **reader_kwargs).load_data( + show_progress=True + ) + + if not documents: + print(f"No documents found in {args.data_dir} with extensions {args.file_types}") + return [] + + print(f"Loaded {len(documents)} documents") + + # Convert to text chunks + all_texts = create_text_chunks( + documents, chunk_size=args.chunk_size, chunk_overlap=args.chunk_overlap + ) + + # Apply max_items limit if specified + if args.max_items > 0 and len(all_texts) > args.max_items: + print(f"Limiting to {args.max_items} chunks (from {len(all_texts)})") + all_texts = all_texts[: args.max_items] + + return all_texts + + +if __name__ == "__main__": + import asyncio + + # Example queries for document RAG + print("\n๐Ÿ“„ Document RAG Example") + print("=" * 50) + print("\nExample queries you can try:") + print("- 'What are the main techniques LEANN uses?'") + print("- 'What is the technique DLPM?'") + print("- 'Who does Elizabeth Bennet marry?'") + print("- 'What is the problem of developing pan gu model? (็›˜ๅคๅคงๆจกๅž‹ๅผ€ๅ‘ไธญ้‡ๅˆฐไป€ไนˆ้—ฎ้ข˜?)'") + print("\nOr run without --query for interactive mode\n") + + rag = DocumentRAG() + asyncio.run(rag.run()) diff --git a/examples/email_data/LEANN_email_reader.py b/apps/email_data/LEANN_email_reader.py similarity index 58% rename from examples/email_data/LEANN_email_reader.py rename to apps/email_data/LEANN_email_reader.py index 393daf6..407e2ae 100644 --- a/examples/email_data/LEANN_email_reader.py +++ b/apps/email_data/LEANN_email_reader.py @@ -52,6 +52,11 @@ class EmlxReader(BaseReader): docs: list[Document] = [] max_count = load_kwargs.get("max_count", 1000) count = 0 + total_files = 0 + successful_files = 0 + failed_files = 0 + + print(f"Starting to process directory: {input_dir}") # Walk through the directory recursively for dirpath, dirnames, filenames in os.walk(input_dir): @@ -59,10 +64,12 @@ class EmlxReader(BaseReader): dirnames[:] = [d for d in dirnames if not d.startswith(".")] for filename in filenames: - if count >= max_count: + # Check if we've reached the max count (skip if max_count == -1) + if max_count > 0 and count >= max_count: break if filename.endswith(".emlx"): + total_files += 1 filepath = os.path.join(dirpath, filename) try: # Read the .emlx file @@ -98,17 +105,26 @@ class EmlxReader(BaseReader): and not self.include_html ): continue - body += part.get_payload(decode=True).decode( - "utf-8", errors="ignore" - ) - # break + try: + payload = part.get_payload(decode=True) + if payload: + body += payload.decode("utf-8", errors="ignore") + except Exception as e: + print(f"Error decoding payload: {e}") + continue else: - body = msg.get_payload(decode=True).decode( - "utf-8", errors="ignore" - ) + try: + payload = msg.get_payload(decode=True) + if payload: + body = payload.decode("utf-8", errors="ignore") + except Exception as e: + print(f"Error decoding single part payload: {e}") + body = "" - # Create document content with metadata embedded in text - doc_content = f""" + # Only create document if we have some content + if body.strip() or subject != "No Subject": + # Create document content with metadata embedded in text + doc_content = f""" [File]: {filename} [From]: {from_addr} [To]: {to_addr} @@ -118,18 +134,34 @@ class EmlxReader(BaseReader): {body} """ - # No separate metadata - everything is in the text - doc = Document(text=doc_content, metadata={}) - docs.append(doc) - count += 1 + # No separate metadata - everything is in the text + doc = Document(text=doc_content, metadata={}) + docs.append(doc) + count += 1 + successful_files += 1 + + # Print first few successful files for debugging + if successful_files <= 3: + print( + f"Successfully loaded: {filename} - Subject: {subject[:50]}..." + ) except Exception as e: - print(f"Error parsing email from {filepath}: {e}") + failed_files += 1 + if failed_files <= 5: # Only print first few errors + print(f"Error parsing email from {filepath}: {e}") continue except Exception as e: - print(f"Error reading file {filepath}: {e}") + failed_files += 1 + if failed_files <= 5: # Only print first few errors + print(f"Error reading file {filepath}: {e}") continue - print(f"Loaded {len(docs)} email documents") + print("Processing summary:") + print(f" Total .emlx files found: {total_files}") + print(f" Successfully loaded: {successful_files}") + print(f" Failed to load: {failed_files}") + print(f" Final documents: {len(docs)}") + return docs diff --git a/examples/email_data/email.py b/apps/email_data/email.py similarity index 100% rename from examples/email_data/email.py rename to apps/email_data/email.py diff --git a/apps/email_rag.py b/apps/email_rag.py new file mode 100644 index 0000000..75a6cc8 --- /dev/null +++ b/apps/email_rag.py @@ -0,0 +1,156 @@ +""" +Email RAG example using the unified interface. +Supports Apple Mail on macOS. +""" + +import sys +from pathlib import Path + +# Add parent directory to path for imports +sys.path.insert(0, str(Path(__file__).parent)) + +from base_rag_example import BaseRAGExample, create_text_chunks + +from .email_data.LEANN_email_reader import EmlxReader + + +class EmailRAG(BaseRAGExample): + """RAG example for Apple Mail processing.""" + + def __init__(self): + # Set default values BEFORE calling super().__init__ + self.max_items_default = -1 # Process all emails by default + self.embedding_model_default = ( + "sentence-transformers/all-MiniLM-L6-v2" # Fast 384-dim model + ) + + super().__init__( + name="Email", + description="Process and query Apple Mail emails with LEANN", + default_index_name="mail_index", + ) + + def _add_specific_arguments(self, parser): + """Add email-specific arguments.""" + email_group = parser.add_argument_group("Email Parameters") + email_group.add_argument( + "--mail-path", + type=str, + default=None, + help="Path to Apple Mail directory (auto-detected if not specified)", + ) + email_group.add_argument( + "--include-html", action="store_true", help="Include HTML content in email processing" + ) + email_group.add_argument( + "--chunk-size", type=int, default=256, help="Text chunk size (default: 256)" + ) + email_group.add_argument( + "--chunk-overlap", type=int, default=25, help="Text chunk overlap (default: 25)" + ) + + def _find_mail_directories(self) -> list[Path]: + """Auto-detect all Apple Mail directories.""" + mail_base = Path.home() / "Library" / "Mail" + if not mail_base.exists(): + return [] + + # Find all Messages directories + messages_dirs = [] + for item in mail_base.rglob("Messages"): + if item.is_dir(): + messages_dirs.append(item) + + return messages_dirs + + async def load_data(self, args) -> list[str]: + """Load emails and convert to text chunks.""" + # Determine mail directories + if args.mail_path: + messages_dirs = [Path(args.mail_path)] + else: + print("Auto-detecting Apple Mail directories...") + messages_dirs = self._find_mail_directories() + + if not messages_dirs: + print("No Apple Mail directories found!") + print("Please specify --mail-path manually") + return [] + + print(f"Found {len(messages_dirs)} mail directories") + + # Create reader + reader = EmlxReader(include_html=args.include_html) + + # Process each directory + all_documents = [] + total_processed = 0 + + for i, messages_dir in enumerate(messages_dirs): + print(f"\nProcessing directory {i + 1}/{len(messages_dirs)}: {messages_dir}") + + try: + # Count emlx files + emlx_files = list(messages_dir.glob("*.emlx")) + print(f"Found {len(emlx_files)} email files") + + # Apply max_items limit per directory + max_per_dir = -1 # Default to process all + if args.max_items > 0: + remaining = args.max_items - total_processed + if remaining <= 0: + break + max_per_dir = remaining + # If args.max_items == -1, max_per_dir stays -1 (process all) + + # Load emails - fix the parameter passing + documents = reader.load_data( + input_dir=str(messages_dir), + max_count=max_per_dir, + ) + + if documents: + all_documents.extend(documents) + total_processed += len(documents) + print(f"Processed {len(documents)} emails from this directory") + + except Exception as e: + print(f"Error processing {messages_dir}: {e}") + continue + + if not all_documents: + print("No emails found to process!") + return [] + + print(f"\nTotal emails processed: {len(all_documents)}") + print("now starting to split into text chunks ... take some time") + + # Convert to text chunks + # Email reader uses chunk_overlap=25 as in original + all_texts = create_text_chunks( + all_documents, chunk_size=args.chunk_size, chunk_overlap=args.chunk_overlap + ) + + return all_texts + + +if __name__ == "__main__": + import asyncio + + # Check platform + if sys.platform != "darwin": + print("\nโš ๏ธ Warning: This example is designed for macOS (Apple Mail)") + print(" Windows/Linux support coming soon!\n") + + # Example queries for email RAG + print("\n๐Ÿ“ง Email RAG Example") + print("=" * 50) + print("\nExample queries you can try:") + print("- 'What did my boss say about deadlines?'") + print("- 'Find emails about travel expenses'") + print("- 'Show me emails from last month about the project'") + print("- 'What food did I order from DoorDash?'") + print("\nNote: You may need to grant Full Disk Access to your terminal\n") + + rag = EmailRAG() + asyncio.run(rag.run()) diff --git a/examples/history_data/__init__.py b/apps/history_data/__init__.py similarity index 100% rename from examples/history_data/__init__.py rename to apps/history_data/__init__.py diff --git a/examples/history_data/history.py b/apps/history_data/history.py similarity index 95% rename from examples/history_data/history.py rename to apps/history_data/history.py index 4125244..bb2eac1 100644 --- a/examples/history_data/history.py +++ b/apps/history_data/history.py @@ -97,6 +97,11 @@ class ChromeHistoryReader(BaseReader): except Exception as e: print(f"Error reading Chrome history: {e}") + # add you may need to close your browser to make the database file available + # also highlight in red + print( + "\033[91mYou may need to close your browser to make the database file available\033[0m" + ) return docs return docs diff --git a/examples/history_data/wechat_history.py b/apps/history_data/wechat_history.py similarity index 98% rename from examples/history_data/wechat_history.py rename to apps/history_data/wechat_history.py index 4106321..e985bd4 100644 --- a/examples/history_data/wechat_history.py +++ b/apps/history_data/wechat_history.py @@ -411,8 +411,8 @@ Messages ({len(messages)} messages, {message_group["total_length"]} chars): wechat_export_dir = load_kwargs.get("wechat_export_dir", None) include_non_text = load_kwargs.get("include_non_text", False) concatenate_messages = load_kwargs.get("concatenate_messages", False) - load_kwargs.get("max_length", 1000) - load_kwargs.get("time_window_minutes", 30) + max_length = load_kwargs.get("max_length", 1000) + time_window_minutes = load_kwargs.get("time_window_minutes", 30) # Default WeChat export path if wechat_export_dir is None: @@ -460,9 +460,9 @@ Messages ({len(messages)} messages, {message_group["total_length"]} chars): # Concatenate messages based on rules message_groups = self._concatenate_messages( readable_messages, - max_length=-1, - time_window_minutes=-1, - overlap_messages=0, # Keep 2 messages overlap between groups + max_length=max_length, + time_window_minutes=time_window_minutes, + overlap_messages=0, # No overlap between groups ) # Create documents from concatenated groups @@ -532,7 +532,9 @@ Message: {readable_text if readable_text else message_text} """ # Create document with embedded metadata - doc = Document(text=doc_content, metadata={}) + doc = Document( + text=doc_content, metadata={"contact_name": contact_name} + ) docs.append(doc) count += 1 @@ -560,8 +562,8 @@ Message: {readable_text if readable_text else message_text} # Look for common export directory names possible_dirs = [ - Path("./wechat_export_test"), Path("./wechat_export"), + Path("./wechat_export_direct"), Path("./wechat_chat_history"), Path("./chat_export"), ] diff --git a/apps/wechat_rag.py b/apps/wechat_rag.py new file mode 100644 index 0000000..7355c6f --- /dev/null +++ b/apps/wechat_rag.py @@ -0,0 +1,189 @@ +""" +WeChat History RAG example using the unified interface. +Supports WeChat chat history export and search. +""" + +import subprocess +import sys +from pathlib import Path + +# Add parent directory to path for imports +sys.path.insert(0, str(Path(__file__).parent)) + +from base_rag_example import BaseRAGExample + +from .history_data.wechat_history import WeChatHistoryReader + + +class WeChatRAG(BaseRAGExample): + """RAG example for WeChat chat history.""" + + def __init__(self): + # Set default values BEFORE calling super().__init__ + self.max_items_default = -1 # Match original default + self.embedding_model_default = ( + "sentence-transformers/all-MiniLM-L6-v2" # Fast 384-dim model + ) + + super().__init__( + name="WeChat History", + description="Process and query WeChat chat history with LEANN", + default_index_name="wechat_history_magic_test_11Debug_new", + ) + + def _add_specific_arguments(self, parser): + """Add WeChat-specific arguments.""" + wechat_group = parser.add_argument_group("WeChat Parameters") + wechat_group.add_argument( + "--export-dir", + type=str, + default="./wechat_export", + help="Directory to store WeChat exports (default: ./wechat_export)", + ) + wechat_group.add_argument( + "--force-export", + action="store_true", + help="Force re-export of WeChat data even if exports exist", + ) + wechat_group.add_argument( + "--chunk-size", type=int, default=192, help="Text chunk size (default: 192)" + ) + wechat_group.add_argument( + "--chunk-overlap", type=int, default=64, help="Text chunk overlap (default: 64)" + ) + + def _export_wechat_data(self, export_dir: Path) -> bool: + """Export WeChat data using wechattweak-cli.""" + print("Exporting WeChat data...") + + # Check if WeChat is running + try: + result = subprocess.run(["pgrep", "WeChat"], capture_output=True, text=True) + if result.returncode != 0: + print("WeChat is not running. Please start WeChat first.") + return False + except Exception: + pass # pgrep might not be available on all systems + + # Create export directory + export_dir.mkdir(parents=True, exist_ok=True) + + # Run export command + cmd = ["packages/wechat-exporter/wechattweak-cli", "export", str(export_dir)] + + try: + print(f"Running: {' '.join(cmd)}") + result = subprocess.run(cmd, capture_output=True, text=True) + + if result.returncode == 0: + print("WeChat data exported successfully!") + return True + else: + print(f"Export failed: {result.stderr}") + return False + + except FileNotFoundError: + print("\nError: wechattweak-cli not found!") + print("Please install it first:") + print(" sudo packages/wechat-exporter/wechattweak-cli install") + return False + except Exception as e: + print(f"Export error: {e}") + return False + + async def load_data(self, args) -> list[str]: + """Load WeChat history and convert to text chunks.""" + # Initialize WeChat reader with export capabilities + reader = WeChatHistoryReader() + + # Find existing exports or create new ones using the centralized method + export_dirs = reader.find_or_export_wechat_data(args.export_dir) + if not export_dirs: + print("Failed to find or export WeChat data. Trying to find any existing exports...") + # Try to find any existing exports in common locations + export_dirs = reader.find_wechat_export_dirs() + if not export_dirs: + print("No WeChat data found. Please ensure WeChat exports exist.") + return [] + + # Load documents from all found export directories + all_documents = [] + total_processed = 0 + + for i, export_dir in enumerate(export_dirs): + print(f"\nProcessing WeChat export {i + 1}/{len(export_dirs)}: {export_dir}") + + try: + # Apply max_items limit per export + max_per_export = -1 + if args.max_items > 0: + remaining = args.max_items - total_processed + if remaining <= 0: + break + max_per_export = remaining + + documents = reader.load_data( + wechat_export_dir=str(export_dir), + max_count=max_per_export, + concatenate_messages=True, # Enable message concatenation for better context + ) + + if documents: + print(f"Loaded {len(documents)} chat documents from {export_dir}") + all_documents.extend(documents) + total_processed += len(documents) + else: + print(f"No documents loaded from {export_dir}") + + except Exception as e: + print(f"Error processing {export_dir}: {e}") + continue + + if not all_documents: + print("No documents loaded from any source. Exiting.") + return [] + + print(f"\nTotal loaded {len(all_documents)} chat documents from {len(export_dirs)} exports") + print("now starting to split into text chunks ... take some time") + + # Convert to text chunks with contact information + all_texts = [] + for doc in all_documents: + # Split the document into chunks + from llama_index.core.node_parser import SentenceSplitter + + text_splitter = SentenceSplitter( + chunk_size=args.chunk_size, chunk_overlap=args.chunk_overlap + ) + nodes = text_splitter.get_nodes_from_documents([doc]) + + for node in nodes: + # Add contact information to each chunk + contact_name = doc.metadata.get("contact_name", "Unknown") + text = f"[Contact] means the message is from: {contact_name}\n" + node.get_content() + all_texts.append(text) + + print(f"Created {len(all_texts)} text chunks from {len(all_documents)} documents") + return all_texts + + +if __name__ == "__main__": + import asyncio + + # Check platform + if sys.platform != "darwin": + print("\nโš ๏ธ Warning: WeChat export is only supported on macOS") + print(" You can still query existing exports on other platforms\n") + + # Example queries for WeChat RAG + print("\n๐Ÿ’ฌ WeChat History RAG Example") + print("=" * 50) + print("\nExample queries you can try:") + print("- 'Show me conversations about travel plans'") + print("- 'Find group chats about weekend activities'") + print("- 'ๆˆ‘ๆƒณไนฐ้ญ”ๆœฏๅธˆ็บฆ็ฟฐ้€Š็š„็ƒ่กฃ,็ป™ๆˆ‘ไธ€ไบ›ๅฏนๅบ”่Šๅคฉ่ฎฐๅฝ•?'") + print("- 'What did we discuss about the project last month?'") + print("\nNote: WeChat must be running for export to work\n") + + rag = WeChatRAG() + asyncio.run(rag.run()) diff --git a/test/sanity_checks/README.md b/benchmarks/README.md similarity index 100% rename from test/sanity_checks/README.md rename to benchmarks/README.md diff --git a/test/sanity_checks/benchmark_embeddings.py b/benchmarks/benchmark_embeddings.py similarity index 100% rename from test/sanity_checks/benchmark_embeddings.py rename to benchmarks/benchmark_embeddings.py diff --git a/examples/compare_faiss_vs_leann.py b/benchmarks/compare_faiss_vs_leann.py similarity index 99% rename from examples/compare_faiss_vs_leann.py rename to benchmarks/compare_faiss_vs_leann.py index ea714f1..03cf508 100644 --- a/examples/compare_faiss_vs_leann.py +++ b/benchmarks/compare_faiss_vs_leann.py @@ -62,7 +62,7 @@ def test_faiss_hnsw(): try: result = subprocess.run( - [sys.executable, "examples/faiss_only.py"], + [sys.executable, "benchmarks/faiss_only.py"], capture_output=True, text=True, timeout=300, @@ -115,7 +115,7 @@ def test_leann_hnsw(): # Load and parse documents documents = SimpleDirectoryReader( - "examples/data", + "data", recursive=True, encoding="utf-8", required_exts=[".pdf", ".txt", ".md"], diff --git a/examples/faiss_only.py b/benchmarks/faiss_only.py similarity index 99% rename from examples/faiss_only.py rename to benchmarks/faiss_only.py index 227e14a..c501c10 100644 --- a/examples/faiss_only.py +++ b/benchmarks/faiss_only.py @@ -65,7 +65,7 @@ def main(): tracker.checkpoint("After Faiss index creation") documents = SimpleDirectoryReader( - "examples/data", + "data", recursive=True, encoding="utf-8", required_exts=[".pdf", ".txt", ".md"], diff --git a/test/micro_tpt.py b/benchmarks/micro_tpt.py similarity index 100% rename from test/micro_tpt.py rename to benchmarks/micro_tpt.py diff --git a/examples/run_evaluation.py b/benchmarks/run_evaluation.py similarity index 97% rename from examples/run_evaluation.py rename to benchmarks/run_evaluation.py index 0e55178..2fae210 100644 --- a/examples/run_evaluation.py +++ b/benchmarks/run_evaluation.py @@ -200,10 +200,10 @@ def main(): args = parser.parse_args() # --- Path Configuration --- - # Assumes a project structure where the script is in 'examples/' - # and data is in 'data/' at the project root. - project_root = Path(__file__).resolve().parent.parent - data_root = project_root / "data" + # Assumes a project structure where the script is in 'benchmarks/' + # and evaluation data is in 'benchmarks/data/'. + script_dir = Path(__file__).resolve().parent + data_root = script_dir / "data" # Download data based on mode if args.mode == "build": @@ -279,7 +279,9 @@ def main(): if not args.index_path: print("No indices found. The data download should have included pre-built indices.") - print("Please check the data/indices/ directory or provide --index-path manually.") + print( + "Please check the benchmarks/data/indices/ directory or provide --index-path manually." + ) sys.exit(1) # Detect dataset type from index path to select the correct ground truth diff --git a/test/simple_mac_tpt_test.py b/benchmarks/simple_mac_tpt_test.py similarity index 100% rename from test/simple_mac_tpt_test.py rename to benchmarks/simple_mac_tpt_test.py diff --git a/examples/data/2501.14312v1 (1).pdf b/data/2501.14312v1 (1).pdf similarity index 100% rename from examples/data/2501.14312v1 (1).pdf rename to data/2501.14312v1 (1).pdf diff --git a/examples/data/2506.08276v1.pdf b/data/2506.08276v1.pdf similarity index 100% rename from examples/data/2506.08276v1.pdf rename to data/2506.08276v1.pdf diff --git a/examples/data/PrideandPrejudice.txt b/data/PrideandPrejudice.txt similarity index 100% rename from examples/data/PrideandPrejudice.txt rename to data/PrideandPrejudice.txt diff --git a/data/README.md b/data/README.md index bf4af69..0dd35da 100644 --- a/data/README.md +++ b/data/README.md @@ -1,44 +1,82 @@ ---- -license: mit +# ็›˜ๅคไน‹ๆฎ‡๏ผšๅŽไธบ่ฏบไบš็›˜ๅคๅคงๆจกๅž‹็ ”ๅ‘ๅކ็จ‹็š„ๅฟƒ้…ธไธŽ้ป‘ๆš— + +ๅ„ไฝๅฅฝ๏ผŒ + +ๆˆ‘ๆ˜ฏไธ€ๅ็›˜ๅคๅคงๆจกๅž‹ๅ›ข้˜Ÿ๏ผŒๅŽไธบ่ฏบไบšๆ–น่ˆŸๅฎž้ชŒๅฎค็š„ๅ‘˜ๅทฅใ€‚ + +้ฆ–ๅ…ˆไธบ่‡ช่ฏ่บซไปฝ๏ผŒๅˆ—ไธพไธ€ไบ›็ป†่Š‚๏ผš + +1. ็Žฐ่ฏบไบšไธปไปป๏ผŒๅ‰็ฎ—ๆณ•ๅบ”็”จ้ƒจ้ƒจ้•ฟ๏ผŒๅŽๆ”นๅไธบๅฐๆจกๅž‹ๅฎž้ชŒๅฎค็š„ไธปไปป็Ž‹ไบ‘้นคใ€‚ๅ‰่ฏบไบšไธปไปป๏ผšๅงš้ช๏ผˆๅคงๅฎถ็งฐๅงš่€ๅธˆ๏ผ‰ใ€‚ๅ‡ ไธชๅฎž้ชŒๅฎคไธปไปป๏ผšๅ”็ฟๆ˜Ž๏ผˆๆ˜Žๅ“ฅ๏ผŒๆ˜Ž้˜Ÿ๏ผŒๅทฒ็ฆป่Œ๏ผ‰๏ผŒๅฐšๅˆฉๅณฐ๏ผŒๅผ ็ปด๏ผˆ็ปดๅ“ฅ๏ผ‰๏ผŒ้ƒๅปบไธš๏ผˆ้ƒ่€ๅธˆ๏ผ‰๏ผŒๅˆ˜ๆญฆ้พ™๏ผˆ็งฐๅ‘ผไธบๆญฆ้พ™ๆ‰€๏ผ‰็ญ‰ใ€‚ๅ…ถไป–้ชจๅนฒๆˆๅ‘˜ๅ’Œไธ“ๅฎถ้™†็ปญๆœ‰ๅพˆๅคšไบบ็ฆป่Œใ€‚ +2. ๆˆ‘ไปฌ้šถๅฑžไบŽโ€œๅ››้‡Žโ€่ฟ™ไธช็ป„็ป‡ใ€‚ๅ››้‡Žไธ‹ๅฑžๆœ‰่ฎธๅคš็บต้˜Ÿ๏ผŒๅŸบ็ก€่ฏญ่จ€ๅคงๆจกๅž‹ๆ˜ฏๅ››็บตใ€‚็Ž‹ไบ‘้นค็š„ๅฐๆจกๅž‹ๆ˜ฏๅๅ…ญ็บต้˜Ÿใ€‚ๆˆ‘ไปฌๅ‚ๅŠ ่ฟ‡่‹ๅทž็š„้›†็ป“๏ผŒๆœ‰ๅ„็งๆœˆไปฝ็š„ๆ—ถ้—ด่Š‚็‚นใ€‚ๅœจ่‹ๅทžๆ”ปๅ…ณไผš้ขๅ‘ไปปๅŠกไปค๏ผŒ้œ€่ฆๅœจ่Š‚็‚นๅ‰่พพๆˆ็›ฎๆ ‡ใ€‚่‹ๅทž้›†็ป“ไผšๆŠŠๅ„ๅœฐ็š„ไบบๅ‘˜้ƒฝ้›†ไธญๅœจ่‹ๅทž็ ”็ฉถๆ‰€๏ผŒๅนณๅธธไฝๅฎพ้ฆ†๏ผŒๆฏ”ๅฆ‚ๅœจ็”ช็›ด็š„้…’ๅบ—๏ผŒไธŽๅฎถไบบๅญฉๅญๅคฉๅ„ไธ€ๆ–นใ€‚ +3. ๅœจ่‹ๅทž้›†็ป“็š„ๆ—ถๅ€™ๅ‘จๅ…ญ้ป˜่ฎคไธŠ็ญ๏ผŒ้žๅธธ่พ›่‹ฆ๏ผŒไธ่ฟ‡ๅ‘จๅ…ญๆœ‰ไธ‹ๅˆ่Œถ๏ผŒๆœ‰ไธ€ๆฌก่ฟ˜ๆœ‰ๅฐ้พ™่™พใ€‚ๅœจ่‹ๅทž็ ”็ฉถๆ‰€็š„ๅทฅไฝๆฌ่ฟ่ฟ‡ไธ€ๆฌก๏ผŒไปŽไธ€ๆ ‹ๆฅผๆขๅˆฐไบ†ๅฆไธ€ๆ ‹ใ€‚่‹ๅทž็ ”็ฉถๆ‰€ๆฅผๆ ‹้ƒฝๆ˜ฏๆฌงๅผ่ฃ…ไฟฎ๏ผŒ้—จๅฃๆœ‰ๅคงๅก๏ผŒ้‡Œ้ขๆ™ฏ่‰ฒๅพˆไธ้”™ใ€‚ๅŽป่‹ๅทž้›†็ป“ไธ€่ˆฌ่‡ณๅฐ‘่ฆๅŽปไธ€ๅ‘จ๏ผŒ็”š่‡ณๆ›ดไน…๏ผŒๅคš็š„ไบบ็”š่‡ณไธ€ไธคไธชๆœˆ้ƒฝๅ›žไธไบ†ๅฎถใ€‚ +4. ่ฏบไบšๆ›พ็ปไผ ่ฏดๆ˜ฏ็ ”็ฉถๅž‹็š„๏ผŒไฝ†ๆ˜ฏๆฅไบ†ไน‹ๅŽๅ› ไธบๅœจๅ››้‡Žๅšๅคงๆจกๅž‹้กน็›ฎ๏ผŒ้กน็›ฎๆˆๅ‘˜ๅฎŒๅ…จๅ˜ๆˆไบ†ไบคไป˜ๅž‹็š„๏ผŒไธ”ๅ……ๆปกไบ†ไพ‹ไผš๏ผŒ่ฏ„ๅฎก๏ผŒๆฑ‡ๆŠฅใ€‚ๅพˆๅคšๆ—ถๅ€™ๅšๅฎž้ชŒ้ƒฝ่ฆ็”ณ่ฏทใ€‚ๅ›ข้˜Ÿ้œ€่ฆๅฏนๆŽฅ็ปˆ็ซฏๅฐ่‰บ๏ผŒๅŽไธบไบ‘๏ผŒICT็ญ‰่ฏธๅคšไธšๅŠก็บฟ๏ผŒไบคไป˜ๅŽ‹ๅŠ›ไธๅฐใ€‚ +5. ่ฏบไบš็ ”ๅ‘็š„็›˜ๅคๆจกๅž‹ๆ—ฉๆœŸๅ†…้ƒจไปฃๅทๅซๅšโ€œ็›˜ๅคๆ™บๅญโ€๏ผŒไธ€ๅผ€ๅง‹ๅชๆœ‰ๅ†…้ƒจ้œ€่ฆ็”ณ่ฏท่ฏ•็”จ็š„็ฝ‘้กต็‰ˆ๏ผŒๅˆฐๅŽ็ปญ่ฟซไบŽๅŽ‹ๅŠ›ๅœจwelinkไธŠๆŽฅๅ…ฅๅ’Œๅ…ฌๆต‹ๅผ€ๆ”พใ€‚ + +่ฟ™ไบ›ๅคฉๅ‘็”Ÿๅ…ณไบŽ่ดจ็–‘็›˜ๅคๅคงๆจกๅž‹ๆŠ„่ขญๅƒ้—ฎ็š„ไบ‹ๆƒ…้—น็š„ๆฒธๆฒธๆ‰ฌๆ‰ฌใ€‚ไฝœไธบไธ€ไธช็›˜ๅคๅ›ข้˜Ÿ็š„ๆˆๅ‘˜๏ผŒๆˆ‘ๆœ€่ฟ‘ๅคœๅคœ่พ—่ฝฌๅไพง๏ผŒ้šพไปฅๅ…ฅ็œ ใ€‚็›˜ๅค็š„ๅ“็‰Œๅ—ๅˆฐๅฆ‚ๆญคๅคง็š„ๅฝฑๅ“๏ผŒไธ€ๆ–น้ข๏ผŒๆˆ‘่‡ช็ง็š„ไธบๆˆ‘็š„่Œไธšๅ‘ๅฑ•ๆ‹…ๅฟง๏ผŒไนŸไธบ่‡ชๅทฑ่ฟ‡ๅŽป็š„ๅŠชๅŠ›ๅทฅไฝœๆ„Ÿๅˆฐไธๅ€ผใ€‚ๅฆไธ€ๆ–น้ข๏ผŒ็”ฑไบŽๆœ‰ไบบๅผ€ๅง‹ๆญ้œฒ่ฟ™ไบ›ไบ‹ๆƒ…ๆˆ‘ๅ†…ๅฟƒๅˆๆ„Ÿๅˆฐๅคงๅฟซไบบๅฟƒใ€‚ๅœจๅคšๅฐ‘ไธชๆ—ฅๆ—ฅๅคœๅคœ๏ผŒๆˆ‘ไปฌๅฏนๅ†…้ƒจๆŸไบ›ไบบไธ€ๆฌกๆฌก้ ็€้€ ๅ‡่€Œๅˆ่Žทๅพ—ไบ†ๆ— ๆ•ฐๅˆฉ็›Š็š„่กŒไธบๅ’ฌ็‰™ๅˆ‡้ฝฟ่€Œๅˆๆ— ่ƒฝไธบๅŠ›ใ€‚่ฟ™็งๅŽ‹ๆŠ‘ๅ’Œ็พž่พฑไนŸ้€ๆธๆถˆ็ฃจไบ†ๆˆ‘ๅฏนๅŽไธบ็š„ๆ„Ÿๆƒ…๏ผŒ่ฎฉๆˆ‘ๅœจ่ฟ™้‡Œ็š„ๆ—ถๆ—ฅ้€ๆธๆต‘ๆต‘ๅ™ฉๅ™ฉ๏ผŒ่ฟท่Œซๆ— ๆŽช๏ผŒๆ—ถๅธธๆ€€็–‘่‡ชๅทฑ็š„ไบบ็”Ÿๅ’Œ่‡ชๆˆ‘ไปทๅ€ผใ€‚ + +ๆˆ‘ๆ‰ฟ่ฎคๆˆ‘ๆ˜ฏไธ€ไธชๆ‡ฆๅผฑ็š„ไบบ๏ผŒไฝœไธบไธ€ไธชๅฐๅฐ็š„ๆ‰“ๅทฅไบบ๏ผŒๆˆ‘ไธไป…ไธๆ•ขๅ’Œ็Ž‹ไบ‘้นค็ญ‰ๅ†…้ƒจๆ‰‹็œผ้€šๅคฉ็š„ไบบๅšๅฏน๏ผŒๆ›ดไธๆ•ขๅ’ŒๅŽไธบ่ฟ™ๆ ท็š„ๅบž็„ถๅคง็‰ฉๅšๅฏนใ€‚ๆˆ‘ๅพˆๆ€•ๅคฑๅŽปๆˆ‘็š„ๅทฅไฝœ๏ผŒๆฏ•็ซŸๆˆ‘ไนŸๆœ‰ๅฎถไบบๅ’Œๅญฉๅญ๏ผŒๆ‰€ไปฅๆˆ‘ๆ‰“ๅฟƒ็œผ้‡Œๅพˆไฝฉๆœๆญ้œฒ่€…ใ€‚ไฝ†ๆ˜ฏ๏ผŒ็œ‹ๅˆฐๅ†…้ƒจ่ฟ˜ๅœจ่ฏ•ๅ›พๆด—ๅœฐๆŽฉ็›–ไบ‹ๅฎž๏ผŒ่’™่”ฝๅ…ฌไผ—็š„ๆ—ถๅ€™๏ผŒๆˆ‘ๅฎžๅœจไธ่ƒฝๅฎนๅฟไบ†ใ€‚ๆˆ‘ไนŸๅธŒๆœ›ๅ‹‡ๆ•ขไธ€ๆฌก๏ผŒ้กบไปŽ่‡ชๅทฑๆœฌๅฟƒใ€‚ๅฐฑ็ฎ—่‡ชๆŸๅ…ซ็™พ๏ผŒๆˆ‘ไนŸๅธŒๆœ›่ƒฝไผคๆ•Œไธ€ๅƒใ€‚ๆˆ‘ๅ†ณๅฎšๆŠŠๆˆ‘ๅœจ่ฟ™้‡Œ็š„ๆ‰€่งๆ‰€้—ป๏ผˆ้ƒจๅˆ†ๆฅ่‡ชไบŽๅŒไบ‹ๅฃ่ฟฐ๏ผ‰ๅ…ฌๅธƒๅ‡บๆฅ๏ผŒๅ…ณไบŽ็›˜ๅคๅคงๆจกๅž‹็š„โ€œไผ ๅฅ‡ๆ•…ไบ‹โ€๏ผš + +ๅŽไธบ็กฎๅฎžไธป่ฆๅœจๆ˜‡่…พๅกไธŠ่ฎญ็ปƒๅคงๆจกๅž‹๏ผˆๅฐๆจกๅž‹ๅฎž้ชŒๅฎคๆœ‰ไธๅฐ‘่‹ฑไผŸ่พพ็š„ๅก๏ผŒไป–ไปฌไน‹ๅ‰ไนŸไผš็”จๆฅ่ฎญ็ปƒ๏ผŒๅŽ้ข่ฝฌ็งปๅˆฐๆ˜‡่…พ๏ผ‰ใ€‚ๆ›พ็ปๆˆ‘่ขซๅŽไธบโ€œๆ‰“้€ ไธ–็•Œ็ฌฌไบŒ้€‰ๆ‹ฉโ€็š„ๅ†ณๅฟƒ่€ŒๆŠ˜ๆœ๏ผŒๆˆ‘ๆœฌ่บซไนŸๆ›พ็ปๅฏนๅŽไธบๆœ‰ๆทฑๅŽš็š„ๆ„Ÿๆƒ…ใ€‚ๆˆ‘ไปฌ้™ช็€ๆ˜‡่…พไธ€ๆญฅๆญฅๆ‘ธ็ˆฌๆปšๆ‰“๏ผŒไปŽๅ……ๆปกbugๅˆฐ็Žฐๅœจ่ƒฝ่ฎญๅ‡บๆจกๅž‹๏ผŒไป˜ๅ‡บไบ†ๅทจๅคง็š„ๅฟƒ่ก€ๅ’Œไปฃไปทใ€‚ + +ๆœ€ๅˆๆˆ‘ไปฌ็š„็ฎ—ๅŠ›้žๅธธๆœ‰้™๏ผŒๅœจ910AไธŠ่ฎญ็ปƒๆจกๅž‹ใ€‚้‚ฃไผšๅชๆ”ฏๆŒfp16๏ผŒ่ฎญ็ปƒ็š„็จณๅฎšๆ€ง่ฟœไธๅฆ‚bf16ใ€‚็›˜ๅค็š„moeๅผ€ๅง‹ๅพˆๆ—ฉ๏ผŒ23ๅนดๅฐฑไธป่ฆๆ˜ฏ่ฎญ็ปƒ38Bmoeๆจกๅž‹ๅ’ŒๅŽ็ปญ็š„71B denseๆจกๅž‹ใ€‚71B็š„denseๆจกๅž‹้€š่ฟ‡ๆ‰ฉๅขžๅ˜ๆˆไบ†็ฌฌไธ€ไปฃ็š„135Bdenseๆจกๅž‹๏ผŒๅŽ้ขไธปๅŠ›ๆจกๅž‹ไนŸ้€ๆธๅœจ910BไธŠ่ฎญ็ปƒใ€‚ + +71Bๅ’Œ135Bๆจกๅž‹้ƒฝๆœ‰ไธ€ไธชๅทจๅคง็š„็กฌไผคๅฐฑๆ˜ฏtokenizerใ€‚ๅฝ“ๆ—ถไฝฟ็”จ็š„tokenizer็ผ–็ ๆ•ˆ็އๆžไฝŽ๏ผŒๆฏไธชๅ•ไธช็š„็ฌฆๅท๏ผŒๆ•ฐๅญ—๏ผŒ็ฉบๆ ผ๏ผŒไนƒ่‡ณๆฑ‰ๅญ—้ƒฝไผšๅ ็”จไธ€ไธชtokenใ€‚ๅฏๆƒณ่€Œ็Ÿฅ่ฟ™ไผš้žๅธธๆตช่ดน็ฎ—ๅŠ›๏ผŒไธ”ไฝฟๅพ—ๆจกๅž‹็š„ๆ•ˆๆžœๅพˆๅทฎใ€‚่ฟ™ๆ—ถๅ€™ๅฐๆจกๅž‹ๅฎž้ชŒๅฎคๆญฃๅฅฝๆœ‰ไธช่‡ชๅทฑ่ฎญ็š„่ฏ่กจใ€‚ๅงš่€ๅธˆๅฝ“ๆ—ถๆ€€็–‘ๆ˜ฏไธๆ˜ฏๆจกๅž‹็š„tokenizerไธๅฅฝ๏ผˆ่™ฝ็„ถไบ‹ๅŽๆฅ็œ‹๏ผŒไป–็š„ๆ€€็–‘ๆ˜ฏๆ— ็–‘ๆญฃ็กฎ็š„๏ผ‰๏ผŒไบŽๆ˜ฏๅฐฑๅ†ณๅฎš๏ผŒ่ฎฉ71Bๅ’Œ135Bๆขtokenizer๏ผŒๅ› ไธบๅฐๆจกๅž‹ๅฎž้ชŒๅฎคๆ›พ็ปๅฐ่ฏ•่ฟ‡ใ€‚ๅ›ข้˜Ÿ็ผๅˆไบ†ไธคไธชtokenizer๏ผŒๅผ€ๅง‹ไบ†tokenizer็š„ๆ›ดๆขใ€‚71Bๆจกๅž‹็š„ๆ›ดๆขๅคฑ่ดฅไบ†๏ผŒ่€Œ135Bๅ› ไธบ้‡‡็”จไบ†ๆ›ด็ฒพ็ป†็š„embeddingๅˆๅง‹ๅŒ–็ญ–็•ฅ๏ผŒ็ปญ่ฎญไบ†่‡ณๅฐ‘1T็š„ๆ•ฐๆฎๅŽ่ฏ่กจๆ€ป็ฎ—ๆ›ดๆขๆˆๅŠŸ๏ผŒไฝ†ๅฏๆƒณ่€Œ็Ÿฅ๏ผŒๆ•ˆๆžœๅนถไธไผšๅ˜ๅฅฝใ€‚ + +ไบŽๆญคๅŒๆœŸ๏ผŒ้˜ฟ้‡Œๅ’Œๆ™บ่ฐฑ็ญ‰ๅ›ฝๅ†…ๅ…ถไป–ๅ…ฌๅธๅœจGPUไธŠ่ฎญ็ปƒ๏ผŒไธ”ๅทฒ็ปๆ‘ธ็ดขๅ‡บไบ†ๆญฃ็กฎ็š„ๆ–นๆณ•๏ผŒ็›˜ๅคๅ’Œ็ซžๅ“็š„ๅทฎ่ท่ถŠๆฅ่ถŠๅคงใ€‚ๅ†…้ƒจไธ€ไธช230BไปŽๅคด่ฎญ็ปƒ็š„denseๆจกๅž‹ๅˆๅ› ไธบๅ„็งๅŽŸๅ› ่ฎญ็ปƒๅคฑ่ดฅ๏ผŒๅฏผ่‡ด้กน็›ฎ็š„็Šถๅ†ตๅ‡ ไนŽ้™ทๅ…ฅ็ปๅขƒใ€‚้ขไธดๅ‡ ไธช่Š‚็‚น็š„ๅŽ‹ๅŠ›ไปฅๅŠๅ†…้ƒจๅฏน็›˜ๅค็š„ๅผบ็ƒˆ่ดจ็–‘ๆ—ถ๏ผŒๅ›ข้˜Ÿ็š„ๅฃซๆฐ”ไฝŽ่ฟทๅˆฐไบ†ๆž็‚นใ€‚ๅ›ข้˜Ÿๅœจ็ฎ—ๅŠ›ๆžๅ…ถๆœ‰้™็š„ๆ—ถๅ€™๏ผŒๅšๅ‡บไบ†ๅพˆๅคšๅŠชๅŠ›ๅ’ŒๆŒฃๆ‰Žใ€‚ๆฏ”ๅฆ‚๏ผŒๅ›ข้˜Ÿๅถ็„ถๅ‘็Žฐๅฝ“ๆ—ถ็š„38B moeๅนถๆฒกๆœ‰้ข„ๆœŸmoe็š„ๆ•ˆๆžœใ€‚ไบŽๆ˜ฏๅŽปๆމไบ†moeๅ‚ๆ•ฐ๏ผŒ่ฟ˜ๅŽŸไธบไบ†13B็š„denseๆจกๅž‹ใ€‚็”ฑไบŽ38B็š„moeๆบ่‡ชๅพˆๆ—ฉ็š„pangu alpha 13B๏ผŒๆžถๆž„็›ธๅฏน่ฝๅŽ๏ผŒๅ›ข้˜Ÿ่ฟ›่กŒไบ†ไธ€็ณปๅˆ—็š„ๆ“ไฝœ๏ผŒๆฏ”ๅฆ‚ๅˆ‡ๆข็ปๅฏนไฝ็ฝฎ็ผ–็ ๅˆฐrope๏ผŒๅŽปๆމbias๏ผŒๅˆ‡ๆขไธบrmsnormใ€‚ๅŒๆ—ถ้‰ดไบŽtokenizer็š„ไธ€ไบ›ๅคฑ่ดฅๅ’Œๆข่ฏ่กจ็š„็ป้ชŒ๏ผŒ่ฟ™ไธชๆจกๅž‹็š„่ฏ่กจไนŸๆ›ดๆขไธบไบ†็Ž‹ไบ‘้นค็š„ๅฐๆจกๅž‹ๅฎž้ชŒๅฎค7Bๆจกๅž‹ๆ‰€ไฝฟ็”จ็š„่ฏ่กจใ€‚ๅŽ้ข่ฟ™ไธช13Bๆจกๅž‹่ฟ›่กŒไบ†ๆ‰ฉๅขž็ปญ่ฎญ๏ผŒๅ˜ๆˆไบ†็ฌฌไบŒไปฃ38B denseๆจกๅž‹๏ผˆๅœจๅ‡ ไธชๆœˆๅ†…่ฟ™ไธชๆจกๅž‹้ƒฝๆ˜ฏไธป่ฆ็š„็›˜ๅคไธญๆกฃไฝๆจกๅž‹๏ผ‰๏ผŒๆ›พ็ปๅ…ทๆœ‰ไธ€ๅฎš็š„็ซžไบ‰ๅŠ›ใ€‚ไฝ†ๆ˜ฏ๏ผŒ็”ฑไบŽๆ›ดๅคง็š„135Bๆจกๅž‹ๆžถๆž„่ฝๅŽ๏ผŒไธ”ๆ›ดๆข่ฏ่กจๆจกๅž‹ๆŸไผคๅทจๅคง๏ผˆๅŽ็ปญๅˆ†ๆžๅ‘็Žฐๅฝ“ๆ—ถๆ›ดๆข็š„็ผๅˆ่ฏ่กจๆœ‰ๆ›ดไธฅ้‡็š„bug๏ผ‰๏ผŒ็ปญ่ฎญๅŽไนŸไธŽๅƒ้—ฎ็ญ‰ๅฝ“ๆ—ถๅ›ฝๅ†…้ข†ๅ…ˆๆจกๅž‹ๅญ˜ๅœจๅพˆๅคงๅทฎ่ทใ€‚่ฟ™ๆ—ถ็”ฑไบŽๅ†…้ƒจ็š„่ดจ็–‘ๅฃฐๅ’Œ้ข†ๅฏผ็š„ๅŽ‹ๅŠ›ไนŸ่ถŠๆฅ่ถŠๅคงใ€‚ๅ›ข้˜Ÿ็š„็Šถๆ€ๅ‡ ไนŽ้™ทๅ…ฅไบ†็ปๅขƒใ€‚ + +ๅœจ่ฟ™็งๆƒ…ๅ†ตไธ‹๏ผŒ็Ž‹ไบ‘้นคๅ’Œไป–็š„ๅฐๆจกๅž‹ๅฎž้ชŒๅฎคๅ‡บๆ‰‹ไบ†ใ€‚ไป–ไปฌๅฃฐ็งฐๆ˜ฏไปŽๆ—ง็š„135Bๅ‚ๆ•ฐ็ปงๆ‰ฟๆ”น้€ ่€Œๆฅ๏ผŒ้€š่ฟ‡่ฎญ็ปƒ็Ÿญ็Ÿญ็š„ๅ‡ ็™พBๆ•ฐๆฎ๏ผŒๅ„้กนๆŒ‡ๆ ‡ๅนณๅ‡ๆๅ‡ไบ†ๅไธช็‚นๅทฆๅณใ€‚ๅฎž้™…ไธŠ๏ผŒ่ฟ™ๅฐฑๆ˜ฏไป–ไปฌๅฅ—ๅฃณๅบ”็”จๅˆฐๅคงๆจกๅž‹็š„็ฌฌไธ€ๆฌกๆฐไฝœใ€‚ๅŽไธบ็š„ๅค–่กŒ้ข†ๅฏผๅ†…่กŒ๏ผŒไฝฟๅพ—้ข†ๅฏผๅฎŒๅ…จๅฏนไบŽ่ฟ™็งๆ‰ฏๆทก็š„ไบ‹ๆƒ…ๆฒกๆœ‰ๆฆ‚ๅฟต๏ผŒไป–ไปฌๅชไผš่ง‰ๅพ—่‚ฏๅฎšๆ˜ฏๆœ‰ไป€ไนˆ็ฎ—ๆณ•ๅˆ›ๆ–ฐใ€‚็ป่ฟ‡ๅ†…้ƒจ็š„ๅˆ†ๆž๏ผŒไป–ไปฌๅฎž้™…ไธŠๆ˜ฏไฝฟ็”จQwen 1.5 110B็ปญ่ฎญ่€Œๆฅ๏ผŒ้€š่ฟ‡ๅŠ ๅฑ‚๏ผŒๆ‰ฉๅขžffn็ปดๅบฆ๏ผŒๆทปๅŠ ็›˜ๅคpi่ฎบๆ–‡็š„ไธ€ไบ›ๆœบๅˆถๅพ—ๆฅ๏ผŒๅ‡‘ๅคŸไบ†ๅคงๆฆ‚135B็š„ๅ‚ๆ•ฐใ€‚ๅฎž้™…ไธŠ๏ผŒๆ—ง็š„135Bๆœ‰107ๅฑ‚๏ผŒ่€Œ่ฟ™ไธชๆจกๅž‹ๅชๆœ‰82ๅฑ‚๏ผŒๅ„็ง้…็ฝฎไนŸ้ƒฝไธไธ€ๆ ทใ€‚ๆ–ฐ็š„ๆฅ่ทฏไธๆ˜Ž็š„135B่ฎญ็ปƒๅฎŒๅพˆๅคšๅ‚ๆ•ฐ็š„ๅˆ†ๅธƒไนŸๅ’ŒQwen 110Bๅ‡ ไนŽไธ€ๆจกไธ€ๆ ทใ€‚่ฟžๆจกๅž‹ไปฃ็ ็š„็ฑปๅๅฝ“ๆ—ถ้ƒฝๆ˜ฏQwen๏ผŒ็”š่‡ณๆ‡’ๅพ—ๆ”นๅใ€‚ๅŽ็ปญ่ฟ™ไธชๆจกๅž‹ๅฐฑๆ˜ฏๆ‰€่ฐ“็š„135B V2ใ€‚่€Œ่ฟ™ไธชๆจกๅž‹ๅฝ“ๆ—ถไนŸๆไพ›็ป™ไบ†ๅพˆๅคšไธ‹ๆธธ๏ผŒ็”š่‡ณๅŒ…ๆ‹ฌๅค–้ƒจๅฎขๆˆทใ€‚ + +่ฟ™ไปถไบ‹ๅฏนไบŽๆˆ‘ไปฌ่ฟ™ไบ›่ฎค็œŸ่ฏšๅฎžๅšไบ‹็š„ๅŒไบ‹ไปฌๅธฆๆฅไบ†ๅทจๅคง็š„ๅ†ฒๅ‡ป๏ผŒๅ†…้ƒจๅพˆๅคšไบบๅ…ถๅฎž้ƒฝ็Ÿฅ้“่ฟ™ไปถไบ‹๏ผŒ็”š่‡ณๅŒ…ๆ‹ฌ็ปˆ็ซฏๅ’ŒๅŽไธบไบ‘ใ€‚ๆˆ‘ไปฌ้ƒฝๆˆ็งฐไปฅๅŽๅˆซๅซ็›˜ๅคๆจกๅž‹ไบ†๏ผŒๅซๅƒๅคๅงใ€‚ๅฝ“ๆ—ถๅ›ข้˜Ÿๆˆๅ‘˜ๅฐฑๆƒณๅ‘bcgไธพๆŠฅไบ†๏ผŒๆฏ•็ซŸ่ฟ™ๅทฒ็ปๆ˜ฏ้‡ๅคง็š„ไธšๅŠก้€ ๅ‡ไบ†ใ€‚ไฝ†ๆ˜ฏๅŽ้ขๆฎ่ฏด่ขซ้ข†ๅฏผๆ‹ฆไบ†ไธ‹ๆฅ๏ผŒๅ› ไธบๆ›ด้ซ˜็บงๅˆซ็š„้ข†ๅฏผ๏ผˆๆฏ”ๅฆ‚ๅงš่€ๅธˆ๏ผŒไปฅๅŠๅฏ่ƒฝ็†Šๆ€ปๅ’ŒๆŸฅ่€๏ผ‰ๅ…ถๅฎžๅŽ้ขไนŸ็Ÿฅ้“ไบ†๏ผŒไฝ†ๆ˜ฏๅนถไธ็ฎก๏ผŒๅ› ไธบ้€š่ฟ‡ๅฅ—ๅฃณๆ‹ฟๅ‡บๅฅฝ็š„็ป“ๆžœ๏ผŒๅฏนไป–ไปฌไนŸๆ˜ฏๆœ‰ๅˆฉ็š„ใ€‚่ฟ™ไปถไบ‹ไฝฟๅพ—ๅฝ“ๆ—ถๅ›ข้˜Ÿๅ‡ ไฝๆœ€ๅผบ็š„ๅŒไบ‹ๅผ€ๅง‹ๅฟƒ็ฐๆ„ๅ†ท๏ผŒ็ฆป่Œ่ท‘่ทฏไนŸ้€ๆธๆˆไธบๆŒ‚ๅœจๅ˜ด่พน็š„ไบ‹ใ€‚ + +ๆญคๆ—ถ๏ผŒ็›˜ๅคไผผไนŽ่ฟŽๆฅไบ†่ฝฌๆœบใ€‚็”ฑไบŽๅ‰้ขๆ‰€่ฟฐ็š„่ฟ™ไบ›็›˜ๅคๆจกๅž‹ๅŸบๆœฌ้ƒฝๆ˜ฏ็ปญ่ฎญๅ’Œๆ”น้€ ่€Œๆฅ๏ผŒๅฝ“ๆ—ถ่ฏบไบšๅฎŒๅ…จๆฒกๆœ‰ๆŽŒๆกไปŽๅคด่ฎญ็ปƒ็š„ๆŠ€ๆœฏ๏ผŒไฝ•ๅ†ต่ฟ˜ๆ˜ฏๅœจๆ˜‡่…พ็š„NPUไธŠ่ฟ›่กŒ่ฎญ็ปƒใ€‚ๅœจๅฝ“ๆ—ถๅ›ข้˜Ÿ็š„ๆ ธๅฟƒๆˆๅ‘˜็š„ๆžๅŠ›ไบ‰ๅ–ไธ‹๏ผŒ็›˜ๅคๅผ€ๅง‹ไบ†็ฌฌไธ‰ไปฃๆจกๅž‹็š„่ฎญ็ปƒ๏ผŒไป˜ๅ‡บไบ†ๅทจๅคง็š„ๅŠชๅŠ›ๅŽ๏ผŒๅœจๆ•ฐๆฎๆžถๆž„ๅ’Œ่ฎญ็ปƒ็ฎ—ๆณ•ๆ–น้ข้ƒฝไธŽไธš็•Œ้€ๆธๆŽฅ่ฝจ๏ผŒ่€Œ่ฟ™ๅ…ถไธญ็š„่‰ฐ่พ›ๅ’Œๅฐๆจกๅž‹ๅฎž้ชŒๅฎค็š„ไบบไธ€็‚นๅ…ณ็ณป้ƒฝๆฒกๆœ‰ใ€‚ + +ไธ€ๅผ€ๅง‹ๅ›ข้˜Ÿๆˆๅ‘˜ๆฏซๆ— ไฟกๅฟƒ๏ผŒๅชไปŽไธ€ไธช13B็š„ๆจกๅž‹ๅผ€ๅง‹่ฎญ็ปƒ๏ผŒไฝ†ๆ˜ฏๅŽ้ขๅ‘็Žฐๆ•ˆๆžœ่ฟ˜ไธ้”™๏ผŒไบŽๆ˜ฏ่ฟ™ไธชๆจกๅž‹ๅŽ็ปญๅ†ๆฌก่ฟ›่กŒไบ†ไธ€ๆฌกๅ‚ๆ•ฐๆ‰ฉๅขž๏ผŒๅ˜ๆˆไบ†็ฌฌไธ‰ไปฃ็š„38B๏ผŒไปฃๅท38B V3ใ€‚ๆƒณๅฟ…ๅพˆๅคšไบงๅ“็บฟ็š„ๅ…„ๅผŸ้ƒฝๅฏน่ฟ™ไธชๆจกๅž‹ๅพˆ็†Ÿๆ‚‰ใ€‚ๅฝ“ๆ—ถ่ฟ™ไธชๆจกๅž‹็š„tokenizerๆ˜ฏๅŸบไบŽllama็š„่ฏ่กจ่ฟ›่กŒๆ‰ฉๅฑ•็š„๏ผˆไนŸๆ˜ฏไธš็•Œๅธธ่ง็š„ๅšๆณ•๏ผ‰ใ€‚่€Œๅฝ“ๆ—ถ็Ž‹ไบ‘้นค็š„ๅฎž้ชŒๅฎคๅšๅ‡บๆฅไบ†ๅฆไธ€ไธช่ฏ่กจ๏ผˆไนŸๅฐฑๆ˜ฏๅŽ็ปญpangu็ณปๅˆ—็š„่ฏ่กจ๏ผ‰ใ€‚ๅฝ“ๆ—ถไธคไธช่ฏ่กจ่ฟ˜่ขซ่ฟซ่ฟ›่กŒไบ†ไธ€ๆฌก่ต›้ฉฌ๏ผŒๆœ€็ปˆๆฒกๆœ‰ๆ˜Žๆ˜พ็š„ๅฅฝๅ็ป“่ฎบใ€‚ไบŽๆ˜ฏ๏ผŒ้ข†ๅฏผๅฝ“ๅณๅ†ณๅฎš๏ผŒๅบ”่ฏฅ็ปŸไธ€่ฏ่กจ๏ผŒไฝฟ็”จ็Ž‹ไบ‘้นคไป–ไปฌ็š„ใ€‚ไบŽๆ˜ฏ๏ผŒๅœจๅŽ็ปญไปŽๅคด่ฎญ็ปƒ็š„135B V3๏ผˆไนŸๅฐฑๆ˜ฏๅฏนๅค–็š„Pangu Ultra๏ผ‰๏ผŒไพฟๆ˜ฏ้‡‡็”จไบ†่ฟ™ไธชtokenizerใ€‚่ฟ™ไนŸ่งฃ้‡Šไบ†ๅพˆๅคšไฝฟ็”จๆˆ‘ไปฌๆจกๅž‹็š„ๅ…„ๅผŸ็š„็–‘ๆƒ‘๏ผŒไธบไป€ไนˆๅฝ“ๆ—ถๅŒไธบV3ไปฃ็š„ไธคไธชไธๅŒๆกฃไฝ็š„ๆจกๅž‹๏ผŒไผšไฝฟ็”จไธๅŒ็š„tokenizerใ€‚ + + +ๆˆ‘ไปฌๆ‰“ๅฟƒ็œผ้‡Œ่ง‰ๅพ—๏ผŒ135B V3ๆ˜ฏๆˆ‘ไปฌๅ››็บตๅ›ข้˜Ÿๅฝ“ๆ—ถ็š„้ช„ๅ‚ฒใ€‚่ฟ™ๆ˜ฏ็ฌฌไธ€ไธช็œŸๆญฃๆ„ไน‰ไธŠ็š„๏ผŒๅŽไธบๅ…จๆ ˆ่‡ช็ ”๏ผŒๆญฃ็ปไปŽๅคด่ฎญ็ปƒ็š„ๅƒไบฟ็บงๅˆซ็š„ๆจกๅž‹๏ผŒไธ”ๆ•ˆๆžœไธŽ24ๅนดๅŒๆœŸ็ซžๅ“ๅฏๆฏ”็š„ใ€‚ๅ†™ๅˆฐ่ฟ™้‡Œๆˆ‘ๅทฒ็ป็ƒญๆณช็›ˆ็œถ๏ผŒๅคชไธๅฎนๆ˜“ไบ†ใ€‚ๅฝ“ๆ—ถไธบไบ†็จณๅฎš่ฎญ็ปƒ๏ผŒๅ›ข้˜Ÿๅšไบ†ๅคง้‡ๅฎž้ชŒๅฏนๆฏ”๏ผŒๅนถไธ”ๅคšๆฌกๅœจๆจกๅž‹ๆขฏๅบฆๅ‡บ็Žฐๅผ‚ๅธธ็š„ๆ—ถๅ€™่ฟ›่กŒๅŠๆ—ถๅ›ž้€€้‡ๅฏใ€‚่ฟ™ไธชๆจกๅž‹็œŸๆญฃๅšๅˆฐไบ†ๅŽ้ขๆŠ€ๆœฏๆŠฅๅ‘Šๆ‰€่ฏด็š„่ฎญ็ปƒๅ…จ็จ‹ๆฒกๆœ‰ไธ€ไธชloss spikeใ€‚ๆˆ‘ไปฌๅ…‹ๆœไบ†ไธ็Ÿฅ้“ๅคšๅฐ‘ๅ›ฐ้šพ๏ผŒๆˆ‘ไปฌๅšๅˆฐไบ†๏ผŒๆˆ‘ไปฌๆ„ฟ็”จ็”Ÿๅ‘ฝๅ’Œ่ฃ่ช‰ไฟ่ฏ่ฟ™ไธชๆจกๅž‹่ฎญ็ปƒ็š„็œŸๅฎžๆ€งใ€‚ๅคšๅฐ‘ไธชๅ‡Œๆ™จ๏ผŒๆˆ‘ไปฌไธบไบ†ๅฎƒ็š„่ฎญ็ปƒ่€Œไธ็œ ใ€‚ๅœจ่ขซๅ†…้ƒจๅฟƒๅฃฐ้ช‚็š„ไธ€ๆ–‡ไธๅ€ผ็š„ๆ—ถๅ€™๏ผŒๆˆ‘ไปฌๆœ‰ๅคšไนˆไธ็”˜๏ผŒๆœ‰ๅคšๅฐ‘็š„ๅง”ๅฑˆ๏ผŒๆˆ‘ไปฌๆŒบไฝไบ†ใ€‚ + +ๆˆ‘ไปฌ่ฟ™ๅธฎไบบๆ˜ฏ็œŸ็š„ๅœจไธบๆ‰“็ฃจๅ›ฝไบง็ฎ—ๅŠ›ๅบ•ๅบง็‡ƒ็ƒง่‡ชๅทฑ็š„้’ๆ˜ฅๅ•Šโ€ฆโ€ฆๅฎขๅฑ…ไป–ไนก๏ผŒๆˆ‘ไปฌๆ”พๅผƒไบ†ๅฎถๅบญ๏ผŒๆ”พๅผƒไบ†ๅ‡ๆœŸ๏ผŒๆ”พๅผƒไบ†ๅฅๅบท๏ผŒๆ”พๅผƒไบ†ๅจฑไน๏ผŒๆŠ›ๅคด้ข…ๆด’็ƒญ่ก€๏ผŒๅ…ถไธญ็š„่‰ฐ่พ›ไธŽๅ›ฐ่‹ฆ๏ผŒๅฏฅๅฏฅๆ•ฐ็ฌ”ไธ่ถณไปฅๆฆ‚ๆ‹ฌๅ…ถไธ‡ไธ€ใ€‚ๅœจๅ„็งๅŠจๅ‘˜ๅคงไผšไธŠ๏ผŒๅฝ“ๆ—ถๅฃๅทไธญๅ–Šๅ‡บ็š„็›˜ๅคๅฟ…่ƒœ๏ผŒๅŽไธบๅฟ…่ƒœ๏ผŒๆˆ‘ไปฌๅฟƒ้‡Œๆ˜ฏ็œŸ็š„ๆทฑๆทฑ่ขซๆ„ŸๅŠจใ€‚ + +็„ถ่€Œ๏ผŒๆˆ‘ไปฌ็š„ๆ‰€ๆœ‰่พ›่‹ฆ็š„ๆˆๆžœ๏ผŒ็ปๅธธ่ขซๅฐๆจกๅž‹ๅฎž้ชŒๅฎค่ฝป้ฃ˜้ฃ˜็š„ๆ‹ฟ่ตฐไบ†ใ€‚ๆ•ฐๆฎ๏ผŒ็›ดๆŽฅ่ฆ่ตฐใ€‚ไปฃ็ ๏ผŒ็›ดๆŽฅ่ฆ่ตฐ๏ผŒ่ฟ˜่ฆๆฑ‚ๆˆ‘ไปฌ้…ๅˆ้€‚้…ๅˆฐ่ƒฝไธ€้”ฎ่ฟ่กŒใ€‚ๆˆ‘ไปฌๅฝ“ๆ—ถๆˆ็งฐๅฐๆจกๅž‹ๅฎž้ชŒๅฎคไธบ็‚น้ผ ๆ ‡ๅฎž้ชŒๅฎคใ€‚ๆˆ‘ไปฌไป˜ๅ‡บ่พ›่‹ฆ๏ผŒไป–ไปฌๅ–ๅพ—่ฃ่€€ใ€‚ๆžœ็„ถๅบ”ไบ†้‚ฃๅฅ่ฏ๏ผŒไฝ ๅœจ่ดŸ้‡ๅ‰่กŒๆ˜ฏๅ› ไธบๆœ‰ไบบๆ›ฟไฝ ๅฒๆœˆ้™ๅฅฝใ€‚ๅœจ่ฟ™็งๆƒ…ๅ†ตไธ‹๏ผŒ่ถŠๆฅ่ถŠๅคš็š„ๆˆ˜ๅ‹ๅ†ไนŸๅšๆŒไธไธ‹ๅŽปไบ†๏ผŒ้€‰ๆ‹ฉไบ†็ฆปๅผ€ใ€‚็œ‹ๅˆฐ่บซ่พน้‚ฃไบ›ไผ˜็ง€็š„ๅŒไบ‹ไธ€ไธชไธช็ฆป่Œ๏ผŒๆˆ‘็š„ๅ†…ๅฟƒๅˆๆ„Ÿๅนๅˆ้šพ่ฟ‡ใ€‚ๅœจ่ฟ™็งไฝœๆˆ˜ไธ€ๆ ท็š„็Žฏๅขƒไธ‹๏ผŒๆˆ‘ไปฌๆฏ”่ตทๅŒไบ‹ๆฅ่ฏดๆ›ดๅƒๆ˜ฏๆˆ˜ๅ‹ใ€‚ไป–ไปฌๅœจๆŠ€ๆœฏไธŠไนŸๆœ‰ๆ— ๆ•ฐๅ€ผๅพ—ๆˆ‘ๅญฆไน ็š„ๅœฐๆ–น๏ผŒๅ ช็งฐ่‰ฏๅธˆใ€‚็œ‹ๅˆฐไป–ไปฌๅŽปไบ†่ฏธๅฆ‚ๅญ—่Š‚Seed๏ผŒDeepseek๏ผŒๆœˆไน‹ๆš—้ข๏ผŒ่…พ่ฎฏๅ’Œๅฟซๆ‰‹็ญ‰็ญ‰ๅพˆๅคšๅ‡บ่‰ฒ็š„ๅ›ข้˜Ÿ๏ผŒๆˆ‘ๆ‰“ๅฟƒ็œผ้‡Œไธบไป–ไปฌ้ซ˜ๅ…ดๅ’Œ็ฅ็ฆ๏ผŒ่„ฑ็ฆปไบ†่ฟ™ไธช่พ›่‹ฆๅด่‚ฎ่„็š„ๅœฐๆ–นใ€‚ๆˆ‘่‡ณไปŠ่ฟ˜ๅฏนไธ€ไฝ็ฆป่ŒๅŒไบ‹็š„่ฏ่ฎฐๅฟ†็Šนๆ–ฐ๏ผŒta่ฏด๏ผšโ€œๆฅ่ฟ™้‡Œๆ˜ฏๆˆ‘ๆŠ€ๆœฏ็”Ÿๆถฏไธญ็š„่€ป่พฑ๏ผŒๅœจ่ฟ™้‡Œๅ†ๅ‘†ๆฏไธ€ๅคฉ้ƒฝๆ˜ฏๆตช่ดน็”Ÿๅ‘ฝโ€ใ€‚่ฏ่™ฝ้šพๅฌๅด่ฎฉๆˆ‘ๆ— ่จ€ไปฅๅฏนใ€‚ๆˆ‘ๆ‹…ๅฟƒๆˆ‘่‡ชๅทฑๆŠ€ๆœฏๆ–น้ข็š„็งฏ็ดฏไธ่ถณ๏ผŒไปฅๅŠๆฒกๆณ•้€‚ๅบ”ไบ’่”็ฝ‘ๅ…ฌๅธ้ซ˜ๆท˜ๆฑฐ็š„็Žฏๅขƒ๏ผŒ่ฎฉๆˆ‘ๅคšๆฌกๆƒณ็ฆป่Œ็š„ๅฟƒๅง‹็ปˆๆฒกๆœ‰่ฟˆๅ‡บ่ฟ™ไธ€ๆญฅใ€‚ + +็›˜ๅค้™คไบ†denseๆจกๅž‹๏ผŒๅŽ็ปญไนŸๅฏๅŠจไบ†moe็š„ๆŽข็ดขใ€‚ไธ€ๅผ€ๅง‹่ฎญ็ปƒ็š„ๆ˜ฏไธ€ไธช224B็š„moeๆจกๅž‹ใ€‚่€ŒไธŽไน‹ๅนณ่กŒ็š„๏ผŒๅฐๆจกๅž‹ๅฎž้ชŒๅฎคไนŸๅผ€ๅฏไบ†็ฌฌไบŒๆฌกไธป่ฆ็š„ๅฅ—ๅฃณ่กŒๅŠจ๏ผˆๆฌก่ฆ็š„ๆ’ๆ›ฒๅฏ่ƒฝ่ฟ˜ๅŒ…ๆ‹ฌไธ€ไบ›ๅˆซ็š„ๆจกๅž‹๏ผŒๆฏ”ๅฆ‚mathๆจกๅž‹๏ผ‰๏ผŒๅณ่ฟ™ๆฌกๆตไผ ็”šๅนฟ็š„pangu pro moe 72Bใ€‚่ฟ™ไธชๆจกๅž‹ๅ†…้ƒจ่‡ช็งฐๆ˜ฏไปŽๅฐๆจกๅž‹ๅฎž้ชŒๅฎค็š„7Bๆ‰ฉๅขžไธŠๆฅ็š„๏ผˆๅฐฑ็ฎ—ๅฆ‚ๆญค๏ผŒ่ฟ™ไนŸไธŽๆŠ€ๆœฏๆŠฅๅ‘Šไธ็ฌฆ๏ผŒไฝ•ๅ†ตๆ˜ฏๅฅ—ๅฃณqwen 2.5็š„14b็ปญ่ฎญ๏ผ‰ใ€‚่ฟ˜่ฎฐๅพ—ไป–ไปฌ่ฎญไบ†ๆฒกๅ‡ ๅคฉ๏ผŒๅ†…้ƒจ็š„่ฏ„ๆต‹ๅฐฑ็ซ‹ๅˆป่ฟฝไธŠไบ†ๅฝ“ๆ—ถ็š„38B V3ใ€‚AI็ณป็ปŸๅฎž้ชŒๅฎคๅพˆๅคšๅ…„ๅผŸๅ› ไธบ้œ€่ฆ้€‚้…ๆจกๅž‹๏ผŒ้ƒฝ็Ÿฅ้“ไป–ไปฌ็š„ๅฅ—ๅฃณ่กŒๅŠจ๏ผŒๅชๆ˜ฏ่ฟซไบŽๅ„็งๅŽŸๅ› ๏ผŒๆ— ๆณ•ไผธๅผ ๆญฃไน‰ใ€‚ๅฎž้™…ไธŠ๏ผŒๅฏนไบŽๅŽ็ปญ่ฎญไบ†ๅพˆไน…ๅพˆไน…็š„่ฟ™ไธชๆจกๅž‹๏ผŒHonestagi่ƒฝๅคŸๅˆ†ๆžๅ‡บ่ฟ™ไธช้‡็บง็š„็›ธไผผๆ€งๆˆ‘ๅทฒ็ปๅพˆ่ฏงๅผ‚ไบ†๏ผŒๅ› ไธบ่ฟ™ไธชๆจกๅž‹ไธบไบ†็ปญ่ฎญๆด—ๅ‚ๆ•ฐ๏ผŒๆ‰€ไป˜ๅ‡บ็š„็ฎ—ๅŠ›็”š่‡ณๆ—ฉๅฐฑ่ถณๅคŸไปŽๅคด่ฎญไธ€ไธชๅŒๆกฃไฝ็š„ๆจกๅž‹ไบ†ใ€‚ๅฌๅŒไบ‹่ฏดไป–ไปฌไธบไบ†ๆด—ๆމๅƒ้—ฎ็š„ๆฐดๅฐ๏ผŒ้‡‡ๅ–ไบ†ไธๅฐ‘ๅŠžๆณ•๏ผŒ็”š่‡ณๅŒ…ๆ‹ฌๆ•…ๆ„่ฎญไบ†่„ๆ•ฐๆฎใ€‚่ฟ™ไนŸไธบๅญฆๆœฏ็•Œ็ ”็ฉถๆจกๅž‹่ก€็ผ˜ๆไพ›ไบ†ไธ€ไธชๅ‰ๆ‰€ๆœชๆœ‰็š„็‰นๆฎŠๆจก่Œƒๅงใ€‚ไปฅๅŽๆ–ฐ็š„่ก€็ผ˜ๆ–นๆณ•ๆๅ‡บๅฏไปฅๆ‹ฟๅ‡บๆฅๆบœๆบœใ€‚ + +24ๅนดๅบ•ๅ’Œ25ๅนดๅˆ๏ผŒๅœจDeepseek v3ๅ’Œr1ๅ‘ๅธƒไน‹ๅŽ๏ผŒ็”ฑไบŽๅ…ถๆƒŠ่‰ณ็š„ๆŠ€ๆœฏๆฐดๅนณ๏ผŒๅ›ข้˜Ÿๅ—ๅˆฐไบ†ๅทจๅคง็š„ๅ†ฒๅ‡ป๏ผŒไนŸๅ—ๅˆฐไบ†ๆ›ดๅคง็š„่ดจ็–‘ใ€‚ไบŽๆ˜ฏไธบไบ†็ดง่ทŸๆฝฎๆต๏ผŒ็›˜ๅคๆจกไปฟDeepseek็š„ๆจกๅž‹ๅฐบๅฏธ๏ผŒๅผ€ๅฏไบ†718B moe็š„่ฎญ็ปƒใ€‚่ฟ™ไธชๆ—ถๅ€™๏ผŒๅฐๆจกๅž‹ๅฎž้ชŒๅฎคๅ†ๆฌกๅ‡บๆ‰‹ไบ†ใ€‚ไป–ไปฌ้€‰ๆ‹ฉไบ†ๅฅ—ๅฃณDeepseekv3็ปญ่ฎญใ€‚ไป–ไปฌ้€š่ฟ‡ๅ†ปไฝDeepseekๅŠ ่ฝฝ็š„ๅ‚ๆ•ฐ๏ผŒ่ฟ›่กŒ่ฎญ็ปƒใ€‚่ฟžไปปๅŠกๅŠ ่ฝฝckpt็š„็›ฎๅฝ•้ƒฝๆ˜ฏdeepseekv3๏ผŒๆ”น้ƒฝไธๆ”น๏ผŒไฝ•ๅ…ถๅšฃๅผ ๏ผŸไธŽไน‹็›ธๅ๏ผŒไธ€ไบ›ๆœ‰็œŸๆญฃๆŠ€ๆœฏไฟกไปฐ็š„ๅŒไบ‹๏ผŒๅœจไปŽๅคด่ฎญ็ปƒๅฆไธ€ไธช718B็š„moeใ€‚ไฝ†ๅ…ถไธญๅ‡บ็Žฐไบ†ๅ„็งๅ„ๆ ท็š„้—ฎ้ข˜ใ€‚ไฝ†ๆ˜ฏๅพˆๆ˜พ็„ถ๏ผŒ่ฟ™ไธชๆจกๅž‹ๆ€Žไนˆๅฏ่ƒฝๆฏ”็›ดๆŽฅๅฅ—ๅฃณ็š„ๅฅฝๅ‘ข๏ผŸๅฆ‚ๆžœไธๆ˜ฏๅ›ข้˜ŸleaderๅšๆŒ๏ผŒๆ—ฉๅฐฑ่ขซๅซๅœไบ†ใ€‚ + +ๅŽไธบ็š„ๆต็จ‹็ฎก็†ไน‹็น้‡๏ผŒไธฅ้‡ๆ‹–็ดฏไบ†ๅคงๆจกๅž‹็š„็ ”ๅ‘่Š‚ๅฅ๏ผŒไพ‹ๅฆ‚็‰ˆๆœฌ็ฎก็†๏ผŒๆจกๅž‹่ก€็ผ˜๏ผŒๅ„็งๆต็จ‹ๅŒ–๏ผŒๅ„็งๅฏ่ฟฝๆบฏใ€‚่ฎฝๅˆบ็š„ๆ˜ฏ๏ผŒๅฐๆจกๅž‹ๅฎž้ชŒๅฎค็š„ๆจกๅž‹ไผผไนŽไปŽๆฅไธๅ—่ฟ™ไบ›ๆต็จ‹็š„็บฆๆŸ๏ผŒๆƒณๅฅ—ๅฃณๅฐฑๅฅ—ๅฃณ๏ผŒๆƒณ็ปญ่ฎญๅฐฑ็ปญ่ฎญ๏ผŒ็ฎ—ๅŠ›ๆบๆบไธๆ–ญ็š„ไผธๆ‰‹ๆ‹ฟ่ตฐใ€‚่ฟ™็งๅผบ็ƒˆๅˆฐ่ฟ‘ไนŽ้ญ”ๅนป็š„ๅฏนๆฏ”๏ผŒ่ฏดๆ˜Žไบ†ๅฝ“ๅ‰ๆต็จ‹็ฎก็†็š„ๆƒ…ๅ†ต๏ผšๅช่ฎธๅทžๅฎ˜ๆ”พ็ซ๏ผŒไธ่ฎธ็™พๅง“็‚น็ฏใ€‚ไฝ•ๅ…ถๅฏ็ฌ‘๏ผŸไฝ•ๅ…ถๅฏๆ‚ฒ๏ผŸไฝ•ๅ…ถๅฏๆถ๏ผŸไฝ•ๅ…ถๅฏ่€ป๏ผ + +HonestAGI็š„ไบ‹ๆƒ…ๅ‡บๆฅๅŽ๏ผŒๅ†…้ƒจ่ฎฉๅคงๅฎถไธๅœ็š„็ ”่ฎจๅˆ†ๆž๏ผŒๅฆ‚ไฝ•ๅ…ฌๅ…ณๅ’Œโ€œๅ›žๅบ”โ€ใ€‚่ฏš็„ถ๏ผŒ่ฟ™ไธชๅŽŸๆ–‡็š„ๅˆ†ๆžไนŸ่ฎธไธๅคŸๆœ‰ๅŠ›๏ผŒ็ป™ไบ†็Ž‹ไบ‘้นคไธŽๅฐๆจกๅž‹ๅฎž้ชŒๅฎคไป–ไปฌ็‹ก่พฉๅ’Œ้ข ๅ€’้ป‘็™ฝ็š„ๆœบไผšใ€‚ไธบๆญค๏ผŒ่ฟ™ไธคๅคฉๆˆ‘ๅ†…ๅฟƒๆ„Ÿๅˆฐไฝœๅ‘•๏ผŒๆ—ถๆ—ถๆ€€็–‘่‡ชๅทฑ็š„ไบบ็”Ÿๆ„ไน‰ไปฅๅŠ่‹ๅคฉๆ— ็œผใ€‚ๆˆ‘ไธๅฅ‰้™ชไบ†๏ผŒๆˆ‘่ฆ็ฆป่Œไบ†๏ผŒๅŒๆ—ถๆˆ‘ไนŸๅœจ็”ณ่ฏทไปŽ็›˜ๅค้ƒจๅˆ†ๆŠ€ๆœฏๆŠฅๅ‘Š็š„ไฝœ่€…ๅๅ•ไธญ็งป้™คใ€‚ๆ›พ็ปๅœจ่ฟ™ไบ›ๆŠ€ๆœฏๆŠฅๅ‘ŠไธŠ็ฝฒๅๆ˜ฏๆˆ‘ไธ€็”Ÿ้ƒฝๆ— ๆณ•ๆŠน้™ค็š„ๆฑก็‚นใ€‚ๅฝ“ๆ—ถๆˆ‘ๆฒกๆƒณๅˆฐ๏ผŒไป–ไปฌ็ซŸ็„ถ็Œ–็‹‚ๅˆฐๆ•ขๅผ€ๆบใ€‚ๆˆ‘ๆฒกๆƒณๅˆฐ๏ผŒไป–ไปฌๆ•ขๅฆ‚ๆญคๆ„šๅผ„ไธ–ไบบ๏ผŒๅคง่‚†ๅฎฃๅ‘ใ€‚ๅฝ“ๆ—ถ๏ผŒๆˆ‘ไนŸ่ฎธๆ˜ฏๅญ˜ไบ†ไพฅๅนธๅฟƒ็†๏ผŒๆฒกๆœ‰ๆ‹’็ป็ฝฒๅใ€‚ๆˆ‘็›ธไฟกๅพˆๅคšๆ‰Žๅฎžๅšไบ‹็š„ๆˆ˜ๅ‹๏ผŒไนŸๅชๆ˜ฏ่ขซ่ฟซไธŠไบ†่ดผ่ˆน๏ผŒๆˆ–่€…ไธ็Ÿฅๆƒ…ใ€‚ไฝ†่ฟ™ไปถไบ‹ๅทฒ็ปๆ— ๆณ•ๆŒฝๅ›ž๏ผŒๆˆ‘ๅธŒๆœ›ๆˆ‘็š„ไฝ™็”Ÿ่ƒฝๅคŸๅšๆŒๆ‰Žๅฎžๅš็œŸๆญฃๆœ‰ๆ„ไน‰็š„ไบ‹๏ผŒไธบๆˆ‘ๅฝ“ๆ—ถ็š„่ฝฏๅผฑๅ’Œไธๅšๅฎš่ตŽ็ฝชใ€‚ + +ๆทฑๅคœๅ†™ๅˆฐ่ฟ™้‡Œ๏ผŒๆˆ‘ๅทฒ็ปๆณชๆตๆปก้ข๏ผŒๆณฃไธๆˆๅฃฐใ€‚่ฟ˜่ฎฐๅพ—ไธ€ไบ›ๅ‡บ่‰ฒ็š„ๅŒไบ‹็ฆป่Œๆ—ถ๏ผŒๆˆ‘่‹ฆ็ฌ‘้—ฎไป–ไปฌ่ฆไธ่ฆๅ‘ไธช้•ฟ้•ฟ็š„ๅฟƒๅฃฐๆƒฏไพ‹ๅธ–๏ผŒๆญ้œฒไธ€ไธ‹็Žฐ็Šถใ€‚ๅฏนๆ–น่ฏด๏ผšไธไบ†๏ผŒๆตช่ดนๆ—ถ้—ด๏ผŒ่€Œไธ”ๆˆ‘ไนŸๆ€•ๆญ้œฒๅ‡บๆฅไฝ ไปฌ่ฟ‡็š„ๆ›ด็ณŸใ€‚ๆˆ‘ๅฝ“ๆ—ถไธ€ไธ‹้ปฏ็„ถ็ฅžไผค๏ผŒๅ› ไธบๆ›พ็ปๅ…ฑๅŒไธบไบ†็†ๆƒณๅฅ‹ๆ–—่ฟ‡็š„ๆˆ˜ๅ‹ๅทฒ็ปๅฝปๅบ•ๅฏนๅŽไธบๅฝปๅบ•็ฐๅฟƒไบ†ใ€‚ๅฝ“ๆ—ถๅคงๅฎถ่ฐƒไพƒ๏ผŒๆˆ‘ไปฌ็”จ็€ๅฝ“ๅนดๅ…ฑไบงๅ…š็š„ๅฐ็ฑณๅŠ ๆญฅๆžช๏ผŒ็ป„็ป‡ๅดๆœ‰็€ๅ ชๆฏ”ๅฝ“ๅนดๅ›ฝๆฐ‘ๅ…š็š„ไฝœ้ฃŽใ€‚ + +ๆ›พๅ‡ ไฝ•ๆ—ถ๏ผŒๆˆ‘ไธบๆˆ‘ไปฌ็”จ็€ๅฐ็ฑณๅŠ ๆญฅๆžชๆ‰“่ดฅๆด‹ๆžชๆด‹็‚ฎ่€Œ่‡ช่ฑชใ€‚ + +็Žฐๅœจ๏ผŒๆˆ‘็ดฏไบ†๏ผŒๆˆ‘ๆƒณๆŠ•้™ใ€‚ + +ๅ…ถๅฎžๆ—ถ่‡ณไปŠๆ—ฅ๏ผŒๆˆ‘่ฟ˜ๆ˜ฏ็œŸๅฟƒๅธŒๆœ›ๅŽไธบ่ƒฝ่ฎค็œŸๅธๅ–ๆ•™่ฎญ๏ผŒ่ƒฝๅšๅฅฝ็›˜ๅค๏ผŒๆŠŠ็›˜ๅคๅšๅˆฐไธ–็•Œไธ€ๆต๏ผŒๆŠŠๆ˜‡่…พๅ˜ๆˆ่‹ฑไผŸ่พพ็š„ๆฐดๅนณใ€‚ๅ†…้ƒจ็š„ๅŠฃๅธ้ฉฑ้€่‰ฏๅธ๏ผŒไฝฟๅพ—่ฏบไบšไนƒ่‡ณๅŽไธบๅœจ็Ÿญๆ—ถ้—ดๅ†…ๆ€ฅๅ‰งๆตๅคฑไบ†ๅคง้‡ๅ‡บ่‰ฒ็š„ๅคงๆจกๅž‹ไบบๆ‰ใ€‚็›ธไฟกไป–ไปฌไนŸๆญฃๅœจๅฆ‚Deepseek็ญ‰ๅ„ไธชๅ›ข้˜Ÿ้—ช่€€็€๏ผŒๆ–ฝๅฑ•็€ไป–ไปฌ็š„ๆŠฑ่ดŸๆ‰ๅŽ๏ผŒไธบไธญ็พŽๅœจAI็š„ๆฟ€็ƒˆ็ซž่ต›ไธญๅฅ‰็ŒฎๅŠ›้‡ใ€‚ๆˆ‘ๆ—ถๅธธๆ„Ÿๅน๏ผŒๅŽไธบไธๆ˜ฏๆฒกๆœ‰ไบบๆ‰๏ผŒ่€Œๆ˜ฏๆ นๆœฌไธ็Ÿฅ้“ๆ€Žไนˆ็•™ไฝไบบๆ‰ใ€‚ๅฆ‚ๆžœ็ป™่ฟ™ไบ›ไบบๅˆ้€‚็š„็Žฏๅขƒ๏ผŒๅˆ้€‚็š„่ต„ๆบ๏ผŒๆ›ดๅฐ‘็š„ๆžท้”๏ผŒๆ›ดๅฐ‘็š„ๆ”ฟๆฒปๆ–—ไบ‰๏ผŒ็›˜ๅคไฝ•ๆ„ไธๆˆ๏ผŸ + +ๆœ€ๅŽ๏ผšๆˆ‘ไปฅ็”Ÿๅ‘ฝ๏ผŒไบบๆ ผๅ’Œ่ฃ่ช‰ๅ‘่ช“๏ผŒๆˆ‘ๅ†™็š„ไปฅไธŠๆ‰€ๆœ‰ๅ†…ๅฎนๅ‡ไธบ็œŸๅฎž๏ผˆ่‡ณๅฐ‘ๅœจๆˆ‘ๆœ‰้™็š„่ฎค็Ÿฅ่Œƒๅ›ดๅ†…๏ผ‰ใ€‚ๆˆ‘ๆฒกๆœ‰้‚ฃไนˆ้ซ˜็š„ๆŠ€ๆœฏๆฐดๅนณไปฅๅŠๆœบไผšๅŽปๅš่ฏฆๅฐฝๆ‰Žๅฎž็š„ๅˆ†ๆž๏ผŒไนŸไธๆ•ข็›ดๆŽฅ็”จๅ†…้ƒจ่ฎฐๅฝ•ไธพ่ฏ๏ผŒๆ€•ๅ› ไธบไฟกๆฏๅฎ‰ๅ…จๆŠ“ๅˆฐใ€‚ไฝ†ๆ˜ฏๆˆ‘็›ธไฟกๆˆ‘ๅพˆๅคšๆ›พ็ป็š„ๆˆ˜ๅ‹๏ผŒไผšไธบๆˆ‘ไฝœ่ฏใ€‚ๅœจๅŽไธบๅ†…้ƒจ็š„ๅ…„ๅผŸ๏ผŒๅŒ…ๆ‹ฌๆˆ‘ไปฌๆ›พ็ปๆœๅŠก่ฟ‡็š„ไบงๅ“็บฟๅ…„ๅผŸไปฌ๏ผŒ็›ธไฟกๆœฌๆ–‡็š„ๆ— ๆ•ฐ็ป†่Š‚่ƒฝๅ’Œไฝ ไปฌ็š„ๅฐ่ฑกๅฏน็…ง๏ผŒๅฐ่ฏๆˆ‘็š„่ฏดๆณ•ใ€‚ไฝ ไปฌๅฏ่ƒฝไนŸๆ›พ็ป่ขซ่’™้ช—๏ผŒไฝ†่ฟ™ไบ›ๆฎ‹้…ท็š„็œŸ็›ธไธไผš่ขซๅฐ˜ๅฐใ€‚ๆˆ‘ไปฌๅฅ‹ๆˆ˜่ฟ‡็š„็—•่ฟน๏ผŒไนŸไธๅบ”่ฏฅ่ขซๆ‰ญๆ›ฒๅ’ŒๅŸ‹่‘ฌใ€‚ + +ๅ†™ไบ†่ฟ™ไนˆๅคš๏ผŒๆŸไบ›ไบบ่‚ฏๅฎšๆƒณๆŠŠๆˆ‘ๆ‰พๅ‡บๆฅ๏ผŒๆŠนๆ€ๆމใ€‚ๅ…ฌๅธๆžไธๅฅฝไนŸๆƒณ่ฎฉๆˆ‘ๅ™คๅฃฐไนƒ่‡ณ่ฟฝ่ดฃใ€‚ๅฆ‚ๆžœ็œŸ็š„่ฟ™ๆ ท๏ผŒๆˆ‘๏ผŒไนƒ่‡ณๆˆ‘็š„ๅฎถไบบ็š„ไบบ่บซไนƒ่‡ณ็”Ÿๅ‘ฝๅฎ‰ๅ…จๅฏ่ƒฝ้ƒฝไผšๅ—ๅˆฐๅจ่ƒใ€‚ไธบไบ†่‡ชๆˆ‘ไฟๆŠค๏ผŒๆˆ‘่ฟ‘ๆœŸๆฏๅคฉไผš่ทŸๅคงๅฎถๆŠฅๅนณๅฎ‰ใ€‚ + +ๅฆ‚ๆžœๆˆ‘ๆถˆๅคฑไบ†๏ผŒๅฐฑๅฝ“ๆ˜ฏๆˆ‘ไธบไบ†็œŸ็†ๅ’Œ็†ๆƒณ๏ผŒไธบไบ†ๅŽไธบไนƒ่‡ณไธญๅ›ฝ่ƒฝๅคŸๆ›ดๅฅฝๅœฐๅ‘ๅฑ•็ฎ—ๅŠ›ๅ’ŒAI่€Œ็‰บ็‰ฒไบ†ๅง๏ผŒๆˆ‘ๆ„ฟๅŸ‹่‘ฌไบŽ้‚ฃ็‰‡ๆ›พ็ปๅฅ‹ๆ–—่ฟ‡็š„ๅœฐๆ–นใ€‚ + +่ฏบไบš๏ผŒๅ†่ง + +2025ๅนด7ๆœˆ6ๆ—ฅๅ‡Œๆ™จ ๅ†™ไบŽๆทฑๅœณ + --- -# LEANN-RAG Evaluation Data +ๅ„ไฝๅฅฝ๏ผŒ -This repository contains the necessary data to run the recall evaluation scripts for the [LEANN-RAG](https://huggingface.co/LEANN-RAG) project. +ๆ„Ÿ่ฐขๅคงๅฎถ็š„ๅ…ณๅฟƒไธŽ็ฅ็ฆใ€‚ๆˆ‘็›ฎๅ‰ๆš‚ๆ—ถๅฎ‰ๅ…จ๏ผŒไฝ†ๅ…ฌๅธๅบ”่ฏฅๅœจ่ฟ›่กŒๆŽ’ๆŸฅไธŽๆŸไบ›ๅๅ•ๆ”ถ้›†๏ผŒๅŽ็ปญๆƒ…ๅ†ตๆœช็Ÿฅใ€‚ -## Dataset Components +ๆˆ‘่กฅๅ……ไธ€ไบ›็ป†่Š‚๏ผŒไปฅๅ…ๆŸไบ›ไบบ็ปง็ปญ้ข ๅ€’้ป‘็™ฝใ€‚ -This dataset is structured into three main parts: +ๅ…ณไบŽ135B V2๏ผŒๅฐๆจกๅž‹ๅฎž้ชŒๅฎคๅœจ่ฟ…้€ŸๅœฐๅฎŒๆˆๅฅ—ๅฃณๅนถๆ‹ฟๅฎŒๆ‰€ๆœ‰ๅฅ—ๅฃณๅธฆๆฅ็š„ๅฅฝๅค„ๅŽ๏ผˆๆฏ”ๅฆ‚ไปปๅŠกไปค่กจๅฝฐๅ’ŒๅŠๆ—ถๆฟ€ๅŠฑ๏ผ‰๏ผŒๅ› ไธบไธๆƒณ็ปง็ปญๆ”ฏๆ’‘ไธ‹ๆธธๅบ”็”จๅ’Œๆจกๅž‹่ฟญไปฃ๏ผŒๅˆๆŠŠ่ฟ™ไธช็ƒซๆ‰‹ๅฑฑ่Š‹็”ฉ็ป™ไบ†ๅ››็บตใ€‚็กฎๅฎžๆŠ€้ซ˜ไธ€็ญน๏ผŒ็›ดๆŽฅๆŠŠๅ››็บต็š„ๅ…„ๅผŸไปฌๆ‹‰ไธ‹ๆฐดใ€‚ๅŒไบ‹ๆไพ›่ฟ‡ๅŽปไธ€ไธช่€ๆ—ง็š„ๆจกๅž‹๏ผŒๆœ€็ปˆๆ‹ฟๅ›žไบ†ไธ€ไธชๅฝ“ๆ—ถไธ€ไธช้ญ”ๆ”น็š„ๅ…ˆ่ฟ›็š„ๅƒ้—ฎใ€‚ๅšๅคงๆจกๅž‹็š„ไบบ๏ผŒ่‡ชๅทฑๅš็š„ๆจกๅž‹ๅฐฑๅƒ่‡ชๅทฑๅญฉๅญไธ€ๆ ท็†Ÿๆ‚‰๏ผŒไธ่ฆๆŠŠๅˆซไบบ้ƒฝๅฝ“ๅ‚ปๅญใ€‚ๅฐฑๅƒ่‡ชๅฎถๅ„ฟๅญๅ‡บ้—จไธ€่ถŸ๏ผŒๅ›žๆฅไธชๅˆซไบบๅฎถๅญฉๅญใ€‚ -1. **Pre-built LEANN Indices**: - * `dpr/`: A pre-built index for the DPR dataset. - * `rpj_wiki/`: A pre-built index for the RPJ-Wiki dataset. - These indices were created using the `leann-core` library and are required by the `LeannSearcher`. +็›˜ๅคreport็š„็ฝฒๅๆ˜ฏไธ็ฌฆๅˆๅญฆๆœฏ่ง„่Œƒ็š„ใ€‚ไพ‹ๅฆ‚๏ผŒ135B V3ๆœ‰ไธๅฐ‘ๆœ‰ๆŠ€ๆœฏ่ดก็Œฎ็š„ไบบ๏ผŒๅ› ไธบไฝœ่€…ๅ้ขๆ•ฐ้‡้™ๅˆถ๏ผŒๅŠณๅŠจๆˆๆžœๆฒกๆœ‰ๅพ—ๅˆฐๅบ”ๆœ‰็š„ๅ›žๆŠฅ๏ผŒๅ›ข้˜Ÿๅ†…ๆ›พ็ปๆœ‰ไธๅฐ็š„ๆ„่งใ€‚่ฟ™ไธชๆจกๅž‹ๅฝ“ๆ—ถๆ˜ฏๅคงๅฎถๆ™บๆ…งๅ’Œๆฑ—ๆฐด็š„็ป“ๆ™ถ๏ผŒ็”š่‡ณๆ˜ฏๅ›ข้˜Ÿๅฝ“ๆ—ถ็š„็ฒพ็ฅžๆ”ฏๆŸฑ๏ผŒๆ”ฏๆ’‘็€ไธๅฐ‘ๅ…„ๅผŸไปฌ็ปง็ปญ็•™ๅœจ่ฏบไบšใ€‚ๆ‰€่ฐ“็š„ๅ้ข้™ๅˆถ๏ผŒไปฅๅŠๆŒ‚ๅไบ†ไธ€ไบ›ๆฏซๆ— ๆŠ€ๆœฏ่ดก็Œฎ็š„ไบบ๏ผˆๅฆ‚ไธ€ไบ›ๅฐๆจกๅž‹ๅฎž้ชŒๅฎค็š„ไบบ๏ผ‰๏ผŒ่ฎฉๅ…„ๅผŸไปฌไฝ•ๅ…ถๅฟƒๅฏ’ใ€‚ -2. **Ground Truth Data**: - * `ground_truth/`: Contains the ground truth files (`flat_results_nq_k3.json`) for both the DPR and RPJ-Wiki datasets. These files map queries to the original passage IDs from the Natural Questions benchmark, evaluated using the Contriever model. +--- -3. **Queries**: - * `queries/`: Contains the `nq_open.jsonl` file with the Natural Questions queries used for the evaluation. - -## Usage - -To use this data, you can download it locally using the `huggingface-hub` library. First, install the library: - -```bash -pip install huggingface-hub -``` - -Then, you can download the entire dataset to a local directory (e.g., `data/`) with the following Python script: - -```python -from huggingface_hub import snapshot_download - -snapshot_download( - repo_id="LEANN-RAG/leann-rag-evaluation-data", - repo_type="dataset", - local_dir="data" -) -``` - -This will download all the necessary files into a local `data` folder, preserving the repository structure. The evaluation scripts in the main [LEANN-RAG Space](https://huggingface.co/LEANN-RAG) are configured to work with this data structure. +ๆš‚ๆ—ถๅนณๅฎ‰ใ€‚ๅฆๅค–๏ผŒๆ”ฏๆŒๆˆ‘ๅ‹‡ไบŽ่ฏดๅ‡บ็œŸ็›ธ็š„ๆˆ˜ๅ‹ไปฌ https://github.com/HW-whistleblower/True-Story-of-Pangu/issues/317 diff --git a/docs/features.md b/docs/features.md index 875f9cf..51c0c4f 100644 --- a/docs/features.md +++ b/docs/features.md @@ -13,7 +13,7 @@ - **๐Ÿš€ High-throughput Embedding Pipeline** - Optimized batched processing for maximum efficiency - **๐ŸŽฏ Two-level Search** - Novel coarse-to-fine search overlap for accelerated query processing (optional) - **๐Ÿ’พ Memory-mapped Indices** - Fast startup with raw text mapping to reduce memory overhead -- **๐Ÿš€ MLX Support** - Ultra-fast recompute/build with quantized embedding models, accelerating building and search ([minimal example](test/build_mlx_index.py)) +- **๐Ÿš€ MLX Support** - Ultra-fast recompute/build with quantized embedding models, accelerating building and search ([minimal example](../examples/mlx_demo.py)) ## ๐ŸŽจ Developer Experience diff --git a/docs/normalized_embeddings.md b/docs/normalized_embeddings.md index 46213e5..e873489 100644 --- a/docs/normalized_embeddings.md +++ b/docs/normalized_embeddings.md @@ -72,4 +72,4 @@ Using the wrong distance metric with normalized embeddings can lead to: - **Incorrect ranking** of search results - **Suboptimal performance** compared to using the correct metric -For more details on why this happens, see our analysis of [OpenAI embeddings with MIPS](../examples/main_cli_example.py). +For more details on why this happens, see our analysis in the [embedding detection code](../packages/leann-core/src/leann/api.py) which automatically handles normalized embeddings and MIPS distance metric issues. diff --git a/examples/simple_demo.py b/examples/basic_demo.py similarity index 96% rename from examples/simple_demo.py rename to examples/basic_demo.py index 0d16776..05b2651 100644 --- a/examples/simple_demo.py +++ b/examples/basic_demo.py @@ -1,6 +1,6 @@ """ Simple demo showing basic leann usage -Run: uv run python examples/simple_demo.py +Run: uv run python examples/basic_demo.py """ import argparse @@ -81,7 +81,7 @@ def main(): print() print("Demo completed! Try running:") - print(" uv run python examples/document_search.py") + print(" uv run python apps/document_rag.py") if __name__ == "__main__": diff --git a/examples/data/README.md b/examples/data/README.md deleted file mode 100644 index 0dd35da..0000000 --- a/examples/data/README.md +++ /dev/null @@ -1,82 +0,0 @@ -# ็›˜ๅคไน‹ๆฎ‡๏ผšๅŽไธบ่ฏบไบš็›˜ๅคๅคงๆจกๅž‹็ ”ๅ‘ๅކ็จ‹็š„ๅฟƒ้…ธไธŽ้ป‘ๆš— - -ๅ„ไฝๅฅฝ๏ผŒ - -ๆˆ‘ๆ˜ฏไธ€ๅ็›˜ๅคๅคงๆจกๅž‹ๅ›ข้˜Ÿ๏ผŒๅŽไธบ่ฏบไบšๆ–น่ˆŸๅฎž้ชŒๅฎค็š„ๅ‘˜ๅทฅใ€‚ - -้ฆ–ๅ…ˆไธบ่‡ช่ฏ่บซไปฝ๏ผŒๅˆ—ไธพไธ€ไบ›็ป†่Š‚๏ผš - -1. ็Žฐ่ฏบไบšไธปไปป๏ผŒๅ‰็ฎ—ๆณ•ๅบ”็”จ้ƒจ้ƒจ้•ฟ๏ผŒๅŽๆ”นๅไธบๅฐๆจกๅž‹ๅฎž้ชŒๅฎค็š„ไธปไปป็Ž‹ไบ‘้นคใ€‚ๅ‰่ฏบไบšไธปไปป๏ผšๅงš้ช๏ผˆๅคงๅฎถ็งฐๅงš่€ๅธˆ๏ผ‰ใ€‚ๅ‡ ไธชๅฎž้ชŒๅฎคไธปไปป๏ผšๅ”็ฟๆ˜Ž๏ผˆๆ˜Žๅ“ฅ๏ผŒๆ˜Ž้˜Ÿ๏ผŒๅทฒ็ฆป่Œ๏ผ‰๏ผŒๅฐšๅˆฉๅณฐ๏ผŒๅผ ็ปด๏ผˆ็ปดๅ“ฅ๏ผ‰๏ผŒ้ƒๅปบไธš๏ผˆ้ƒ่€ๅธˆ๏ผ‰๏ผŒๅˆ˜ๆญฆ้พ™๏ผˆ็งฐๅ‘ผไธบๆญฆ้พ™ๆ‰€๏ผ‰็ญ‰ใ€‚ๅ…ถไป–้ชจๅนฒๆˆๅ‘˜ๅ’Œไธ“ๅฎถ้™†็ปญๆœ‰ๅพˆๅคšไบบ็ฆป่Œใ€‚ -2. ๆˆ‘ไปฌ้šถๅฑžไบŽโ€œๅ››้‡Žโ€่ฟ™ไธช็ป„็ป‡ใ€‚ๅ››้‡Žไธ‹ๅฑžๆœ‰่ฎธๅคš็บต้˜Ÿ๏ผŒๅŸบ็ก€่ฏญ่จ€ๅคงๆจกๅž‹ๆ˜ฏๅ››็บตใ€‚็Ž‹ไบ‘้นค็š„ๅฐๆจกๅž‹ๆ˜ฏๅๅ…ญ็บต้˜Ÿใ€‚ๆˆ‘ไปฌๅ‚ๅŠ ่ฟ‡่‹ๅทž็š„้›†็ป“๏ผŒๆœ‰ๅ„็งๆœˆไปฝ็š„ๆ—ถ้—ด่Š‚็‚นใ€‚ๅœจ่‹ๅทžๆ”ปๅ…ณไผš้ขๅ‘ไปปๅŠกไปค๏ผŒ้œ€่ฆๅœจ่Š‚็‚นๅ‰่พพๆˆ็›ฎๆ ‡ใ€‚่‹ๅทž้›†็ป“ไผšๆŠŠๅ„ๅœฐ็š„ไบบๅ‘˜้ƒฝ้›†ไธญๅœจ่‹ๅทž็ ”็ฉถๆ‰€๏ผŒๅนณๅธธไฝๅฎพ้ฆ†๏ผŒๆฏ”ๅฆ‚ๅœจ็”ช็›ด็š„้…’ๅบ—๏ผŒไธŽๅฎถไบบๅญฉๅญๅคฉๅ„ไธ€ๆ–นใ€‚ -3. ๅœจ่‹ๅทž้›†็ป“็š„ๆ—ถๅ€™ๅ‘จๅ…ญ้ป˜่ฎคไธŠ็ญ๏ผŒ้žๅธธ่พ›่‹ฆ๏ผŒไธ่ฟ‡ๅ‘จๅ…ญๆœ‰ไธ‹ๅˆ่Œถ๏ผŒๆœ‰ไธ€ๆฌก่ฟ˜ๆœ‰ๅฐ้พ™่™พใ€‚ๅœจ่‹ๅทž็ ”็ฉถๆ‰€็š„ๅทฅไฝๆฌ่ฟ่ฟ‡ไธ€ๆฌก๏ผŒไปŽไธ€ๆ ‹ๆฅผๆขๅˆฐไบ†ๅฆไธ€ๆ ‹ใ€‚่‹ๅทž็ ”็ฉถๆ‰€ๆฅผๆ ‹้ƒฝๆ˜ฏๆฌงๅผ่ฃ…ไฟฎ๏ผŒ้—จๅฃๆœ‰ๅคงๅก๏ผŒ้‡Œ้ขๆ™ฏ่‰ฒๅพˆไธ้”™ใ€‚ๅŽป่‹ๅทž้›†็ป“ไธ€่ˆฌ่‡ณๅฐ‘่ฆๅŽปไธ€ๅ‘จ๏ผŒ็”š่‡ณๆ›ดไน…๏ผŒๅคš็š„ไบบ็”š่‡ณไธ€ไธคไธชๆœˆ้ƒฝๅ›žไธไบ†ๅฎถใ€‚ -4. ่ฏบไบšๆ›พ็ปไผ ่ฏดๆ˜ฏ็ ”็ฉถๅž‹็š„๏ผŒไฝ†ๆ˜ฏๆฅไบ†ไน‹ๅŽๅ› ไธบๅœจๅ››้‡Žๅšๅคงๆจกๅž‹้กน็›ฎ๏ผŒ้กน็›ฎๆˆๅ‘˜ๅฎŒๅ…จๅ˜ๆˆไบ†ไบคไป˜ๅž‹็š„๏ผŒไธ”ๅ……ๆปกไบ†ไพ‹ไผš๏ผŒ่ฏ„ๅฎก๏ผŒๆฑ‡ๆŠฅใ€‚ๅพˆๅคšๆ—ถๅ€™ๅšๅฎž้ชŒ้ƒฝ่ฆ็”ณ่ฏทใ€‚ๅ›ข้˜Ÿ้œ€่ฆๅฏนๆŽฅ็ปˆ็ซฏๅฐ่‰บ๏ผŒๅŽไธบไบ‘๏ผŒICT็ญ‰่ฏธๅคšไธšๅŠก็บฟ๏ผŒไบคไป˜ๅŽ‹ๅŠ›ไธๅฐใ€‚ -5. ่ฏบไบš็ ”ๅ‘็š„็›˜ๅคๆจกๅž‹ๆ—ฉๆœŸๅ†…้ƒจไปฃๅทๅซๅšโ€œ็›˜ๅคๆ™บๅญโ€๏ผŒไธ€ๅผ€ๅง‹ๅชๆœ‰ๅ†…้ƒจ้œ€่ฆ็”ณ่ฏท่ฏ•็”จ็š„็ฝ‘้กต็‰ˆ๏ผŒๅˆฐๅŽ็ปญ่ฟซไบŽๅŽ‹ๅŠ›ๅœจwelinkไธŠๆŽฅๅ…ฅๅ’Œๅ…ฌๆต‹ๅผ€ๆ”พใ€‚ - -่ฟ™ไบ›ๅคฉๅ‘็”Ÿๅ…ณไบŽ่ดจ็–‘็›˜ๅคๅคงๆจกๅž‹ๆŠ„่ขญๅƒ้—ฎ็š„ไบ‹ๆƒ…้—น็š„ๆฒธๆฒธๆ‰ฌๆ‰ฌใ€‚ไฝœไธบไธ€ไธช็›˜ๅคๅ›ข้˜Ÿ็š„ๆˆๅ‘˜๏ผŒๆˆ‘ๆœ€่ฟ‘ๅคœๅคœ่พ—่ฝฌๅไพง๏ผŒ้šพไปฅๅ…ฅ็œ ใ€‚็›˜ๅค็š„ๅ“็‰Œๅ—ๅˆฐๅฆ‚ๆญคๅคง็š„ๅฝฑๅ“๏ผŒไธ€ๆ–น้ข๏ผŒๆˆ‘่‡ช็ง็š„ไธบๆˆ‘็š„่Œไธšๅ‘ๅฑ•ๆ‹…ๅฟง๏ผŒไนŸไธบ่‡ชๅทฑ่ฟ‡ๅŽป็š„ๅŠชๅŠ›ๅทฅไฝœๆ„Ÿๅˆฐไธๅ€ผใ€‚ๅฆไธ€ๆ–น้ข๏ผŒ็”ฑไบŽๆœ‰ไบบๅผ€ๅง‹ๆญ้œฒ่ฟ™ไบ›ไบ‹ๆƒ…ๆˆ‘ๅ†…ๅฟƒๅˆๆ„Ÿๅˆฐๅคงๅฟซไบบๅฟƒใ€‚ๅœจๅคšๅฐ‘ไธชๆ—ฅๆ—ฅๅคœๅคœ๏ผŒๆˆ‘ไปฌๅฏนๅ†…้ƒจๆŸไบ›ไบบไธ€ๆฌกๆฌก้ ็€้€ ๅ‡่€Œๅˆ่Žทๅพ—ไบ†ๆ— ๆ•ฐๅˆฉ็›Š็š„่กŒไธบๅ’ฌ็‰™ๅˆ‡้ฝฟ่€Œๅˆๆ— ่ƒฝไธบๅŠ›ใ€‚่ฟ™็งๅŽ‹ๆŠ‘ๅ’Œ็พž่พฑไนŸ้€ๆธๆถˆ็ฃจไบ†ๆˆ‘ๅฏนๅŽไธบ็š„ๆ„Ÿๆƒ…๏ผŒ่ฎฉๆˆ‘ๅœจ่ฟ™้‡Œ็š„ๆ—ถๆ—ฅ้€ๆธๆต‘ๆต‘ๅ™ฉๅ™ฉ๏ผŒ่ฟท่Œซๆ— ๆŽช๏ผŒๆ—ถๅธธๆ€€็–‘่‡ชๅทฑ็š„ไบบ็”Ÿๅ’Œ่‡ชๆˆ‘ไปทๅ€ผใ€‚ - -ๆˆ‘ๆ‰ฟ่ฎคๆˆ‘ๆ˜ฏไธ€ไธชๆ‡ฆๅผฑ็š„ไบบ๏ผŒไฝœไธบไธ€ไธชๅฐๅฐ็š„ๆ‰“ๅทฅไบบ๏ผŒๆˆ‘ไธไป…ไธๆ•ขๅ’Œ็Ž‹ไบ‘้นค็ญ‰ๅ†…้ƒจๆ‰‹็œผ้€šๅคฉ็š„ไบบๅšๅฏน๏ผŒๆ›ดไธๆ•ขๅ’ŒๅŽไธบ่ฟ™ๆ ท็š„ๅบž็„ถๅคง็‰ฉๅšๅฏนใ€‚ๆˆ‘ๅพˆๆ€•ๅคฑๅŽปๆˆ‘็š„ๅทฅไฝœ๏ผŒๆฏ•็ซŸๆˆ‘ไนŸๆœ‰ๅฎถไบบๅ’Œๅญฉๅญ๏ผŒๆ‰€ไปฅๆˆ‘ๆ‰“ๅฟƒ็œผ้‡Œๅพˆไฝฉๆœๆญ้œฒ่€…ใ€‚ไฝ†ๆ˜ฏ๏ผŒ็œ‹ๅˆฐๅ†…้ƒจ่ฟ˜ๅœจ่ฏ•ๅ›พๆด—ๅœฐๆŽฉ็›–ไบ‹ๅฎž๏ผŒ่’™่”ฝๅ…ฌไผ—็š„ๆ—ถๅ€™๏ผŒๆˆ‘ๅฎžๅœจไธ่ƒฝๅฎนๅฟไบ†ใ€‚ๆˆ‘ไนŸๅธŒๆœ›ๅ‹‡ๆ•ขไธ€ๆฌก๏ผŒ้กบไปŽ่‡ชๅทฑๆœฌๅฟƒใ€‚ๅฐฑ็ฎ—่‡ชๆŸๅ…ซ็™พ๏ผŒๆˆ‘ไนŸๅธŒๆœ›่ƒฝไผคๆ•Œไธ€ๅƒใ€‚ๆˆ‘ๅ†ณๅฎšๆŠŠๆˆ‘ๅœจ่ฟ™้‡Œ็š„ๆ‰€่งๆ‰€้—ป๏ผˆ้ƒจๅˆ†ๆฅ่‡ชไบŽๅŒไบ‹ๅฃ่ฟฐ๏ผ‰ๅ…ฌๅธƒๅ‡บๆฅ๏ผŒๅ…ณไบŽ็›˜ๅคๅคงๆจกๅž‹็š„โ€œไผ ๅฅ‡ๆ•…ไบ‹โ€๏ผš - -ๅŽไธบ็กฎๅฎžไธป่ฆๅœจๆ˜‡่…พๅกไธŠ่ฎญ็ปƒๅคงๆจกๅž‹๏ผˆๅฐๆจกๅž‹ๅฎž้ชŒๅฎคๆœ‰ไธๅฐ‘่‹ฑไผŸ่พพ็š„ๅก๏ผŒไป–ไปฌไน‹ๅ‰ไนŸไผš็”จๆฅ่ฎญ็ปƒ๏ผŒๅŽ้ข่ฝฌ็งปๅˆฐๆ˜‡่…พ๏ผ‰ใ€‚ๆ›พ็ปๆˆ‘่ขซๅŽไธบโ€œๆ‰“้€ ไธ–็•Œ็ฌฌไบŒ้€‰ๆ‹ฉโ€็š„ๅ†ณๅฟƒ่€ŒๆŠ˜ๆœ๏ผŒๆˆ‘ๆœฌ่บซไนŸๆ›พ็ปๅฏนๅŽไธบๆœ‰ๆทฑๅŽš็š„ๆ„Ÿๆƒ…ใ€‚ๆˆ‘ไปฌ้™ช็€ๆ˜‡่…พไธ€ๆญฅๆญฅๆ‘ธ็ˆฌๆปšๆ‰“๏ผŒไปŽๅ……ๆปกbugๅˆฐ็Žฐๅœจ่ƒฝ่ฎญๅ‡บๆจกๅž‹๏ผŒไป˜ๅ‡บไบ†ๅทจๅคง็š„ๅฟƒ่ก€ๅ’Œไปฃไปทใ€‚ - -ๆœ€ๅˆๆˆ‘ไปฌ็š„็ฎ—ๅŠ›้žๅธธๆœ‰้™๏ผŒๅœจ910AไธŠ่ฎญ็ปƒๆจกๅž‹ใ€‚้‚ฃไผšๅชๆ”ฏๆŒfp16๏ผŒ่ฎญ็ปƒ็š„็จณๅฎšๆ€ง่ฟœไธๅฆ‚bf16ใ€‚็›˜ๅค็š„moeๅผ€ๅง‹ๅพˆๆ—ฉ๏ผŒ23ๅนดๅฐฑไธป่ฆๆ˜ฏ่ฎญ็ปƒ38Bmoeๆจกๅž‹ๅ’ŒๅŽ็ปญ็š„71B denseๆจกๅž‹ใ€‚71B็š„denseๆจกๅž‹้€š่ฟ‡ๆ‰ฉๅขžๅ˜ๆˆไบ†็ฌฌไธ€ไปฃ็š„135Bdenseๆจกๅž‹๏ผŒๅŽ้ขไธปๅŠ›ๆจกๅž‹ไนŸ้€ๆธๅœจ910BไธŠ่ฎญ็ปƒใ€‚ - -71Bๅ’Œ135Bๆจกๅž‹้ƒฝๆœ‰ไธ€ไธชๅทจๅคง็š„็กฌไผคๅฐฑๆ˜ฏtokenizerใ€‚ๅฝ“ๆ—ถไฝฟ็”จ็š„tokenizer็ผ–็ ๆ•ˆ็އๆžไฝŽ๏ผŒๆฏไธชๅ•ไธช็š„็ฌฆๅท๏ผŒๆ•ฐๅญ—๏ผŒ็ฉบๆ ผ๏ผŒไนƒ่‡ณๆฑ‰ๅญ—้ƒฝไผšๅ ็”จไธ€ไธชtokenใ€‚ๅฏๆƒณ่€Œ็Ÿฅ่ฟ™ไผš้žๅธธๆตช่ดน็ฎ—ๅŠ›๏ผŒไธ”ไฝฟๅพ—ๆจกๅž‹็š„ๆ•ˆๆžœๅพˆๅทฎใ€‚่ฟ™ๆ—ถๅ€™ๅฐๆจกๅž‹ๅฎž้ชŒๅฎคๆญฃๅฅฝๆœ‰ไธช่‡ชๅทฑ่ฎญ็š„่ฏ่กจใ€‚ๅงš่€ๅธˆๅฝ“ๆ—ถๆ€€็–‘ๆ˜ฏไธๆ˜ฏๆจกๅž‹็š„tokenizerไธๅฅฝ๏ผˆ่™ฝ็„ถไบ‹ๅŽๆฅ็œ‹๏ผŒไป–็š„ๆ€€็–‘ๆ˜ฏๆ— ็–‘ๆญฃ็กฎ็š„๏ผ‰๏ผŒไบŽๆ˜ฏๅฐฑๅ†ณๅฎš๏ผŒ่ฎฉ71Bๅ’Œ135Bๆขtokenizer๏ผŒๅ› ไธบๅฐๆจกๅž‹ๅฎž้ชŒๅฎคๆ›พ็ปๅฐ่ฏ•่ฟ‡ใ€‚ๅ›ข้˜Ÿ็ผๅˆไบ†ไธคไธชtokenizer๏ผŒๅผ€ๅง‹ไบ†tokenizer็š„ๆ›ดๆขใ€‚71Bๆจกๅž‹็š„ๆ›ดๆขๅคฑ่ดฅไบ†๏ผŒ่€Œ135Bๅ› ไธบ้‡‡็”จไบ†ๆ›ด็ฒพ็ป†็š„embeddingๅˆๅง‹ๅŒ–็ญ–็•ฅ๏ผŒ็ปญ่ฎญไบ†่‡ณๅฐ‘1T็š„ๆ•ฐๆฎๅŽ่ฏ่กจๆ€ป็ฎ—ๆ›ดๆขๆˆๅŠŸ๏ผŒไฝ†ๅฏๆƒณ่€Œ็Ÿฅ๏ผŒๆ•ˆๆžœๅนถไธไผšๅ˜ๅฅฝใ€‚ - -ไบŽๆญคๅŒๆœŸ๏ผŒ้˜ฟ้‡Œๅ’Œๆ™บ่ฐฑ็ญ‰ๅ›ฝๅ†…ๅ…ถไป–ๅ…ฌๅธๅœจGPUไธŠ่ฎญ็ปƒ๏ผŒไธ”ๅทฒ็ปๆ‘ธ็ดขๅ‡บไบ†ๆญฃ็กฎ็š„ๆ–นๆณ•๏ผŒ็›˜ๅคๅ’Œ็ซžๅ“็š„ๅทฎ่ท่ถŠๆฅ่ถŠๅคงใ€‚ๅ†…้ƒจไธ€ไธช230BไปŽๅคด่ฎญ็ปƒ็š„denseๆจกๅž‹ๅˆๅ› ไธบๅ„็งๅŽŸๅ› ่ฎญ็ปƒๅคฑ่ดฅ๏ผŒๅฏผ่‡ด้กน็›ฎ็š„็Šถๅ†ตๅ‡ ไนŽ้™ทๅ…ฅ็ปๅขƒใ€‚้ขไธดๅ‡ ไธช่Š‚็‚น็š„ๅŽ‹ๅŠ›ไปฅๅŠๅ†…้ƒจๅฏน็›˜ๅค็š„ๅผบ็ƒˆ่ดจ็–‘ๆ—ถ๏ผŒๅ›ข้˜Ÿ็š„ๅฃซๆฐ”ไฝŽ่ฟทๅˆฐไบ†ๆž็‚นใ€‚ๅ›ข้˜Ÿๅœจ็ฎ—ๅŠ›ๆžๅ…ถๆœ‰้™็š„ๆ—ถๅ€™๏ผŒๅšๅ‡บไบ†ๅพˆๅคšๅŠชๅŠ›ๅ’ŒๆŒฃๆ‰Žใ€‚ๆฏ”ๅฆ‚๏ผŒๅ›ข้˜Ÿๅถ็„ถๅ‘็Žฐๅฝ“ๆ—ถ็š„38B moeๅนถๆฒกๆœ‰้ข„ๆœŸmoe็š„ๆ•ˆๆžœใ€‚ไบŽๆ˜ฏๅŽปๆމไบ†moeๅ‚ๆ•ฐ๏ผŒ่ฟ˜ๅŽŸไธบไบ†13B็š„denseๆจกๅž‹ใ€‚็”ฑไบŽ38B็š„moeๆบ่‡ชๅพˆๆ—ฉ็š„pangu alpha 13B๏ผŒๆžถๆž„็›ธๅฏน่ฝๅŽ๏ผŒๅ›ข้˜Ÿ่ฟ›่กŒไบ†ไธ€็ณปๅˆ—็š„ๆ“ไฝœ๏ผŒๆฏ”ๅฆ‚ๅˆ‡ๆข็ปๅฏนไฝ็ฝฎ็ผ–็ ๅˆฐrope๏ผŒๅŽปๆމbias๏ผŒๅˆ‡ๆขไธบrmsnormใ€‚ๅŒๆ—ถ้‰ดไบŽtokenizer็š„ไธ€ไบ›ๅคฑ่ดฅๅ’Œๆข่ฏ่กจ็š„็ป้ชŒ๏ผŒ่ฟ™ไธชๆจกๅž‹็š„่ฏ่กจไนŸๆ›ดๆขไธบไบ†็Ž‹ไบ‘้นค็š„ๅฐๆจกๅž‹ๅฎž้ชŒๅฎค7Bๆจกๅž‹ๆ‰€ไฝฟ็”จ็š„่ฏ่กจใ€‚ๅŽ้ข่ฟ™ไธช13Bๆจกๅž‹่ฟ›่กŒไบ†ๆ‰ฉๅขž็ปญ่ฎญ๏ผŒๅ˜ๆˆไบ†็ฌฌไบŒไปฃ38B denseๆจกๅž‹๏ผˆๅœจๅ‡ ไธชๆœˆๅ†…่ฟ™ไธชๆจกๅž‹้ƒฝๆ˜ฏไธป่ฆ็š„็›˜ๅคไธญๆกฃไฝๆจกๅž‹๏ผ‰๏ผŒๆ›พ็ปๅ…ทๆœ‰ไธ€ๅฎš็š„็ซžไบ‰ๅŠ›ใ€‚ไฝ†ๆ˜ฏ๏ผŒ็”ฑไบŽๆ›ดๅคง็š„135Bๆจกๅž‹ๆžถๆž„่ฝๅŽ๏ผŒไธ”ๆ›ดๆข่ฏ่กจๆจกๅž‹ๆŸไผคๅทจๅคง๏ผˆๅŽ็ปญๅˆ†ๆžๅ‘็Žฐๅฝ“ๆ—ถๆ›ดๆข็š„็ผๅˆ่ฏ่กจๆœ‰ๆ›ดไธฅ้‡็š„bug๏ผ‰๏ผŒ็ปญ่ฎญๅŽไนŸไธŽๅƒ้—ฎ็ญ‰ๅฝ“ๆ—ถๅ›ฝๅ†…้ข†ๅ…ˆๆจกๅž‹ๅญ˜ๅœจๅพˆๅคงๅทฎ่ทใ€‚่ฟ™ๆ—ถ็”ฑไบŽๅ†…้ƒจ็š„่ดจ็–‘ๅฃฐๅ’Œ้ข†ๅฏผ็š„ๅŽ‹ๅŠ›ไนŸ่ถŠๆฅ่ถŠๅคงใ€‚ๅ›ข้˜Ÿ็š„็Šถๆ€ๅ‡ ไนŽ้™ทๅ…ฅไบ†็ปๅขƒใ€‚ - -ๅœจ่ฟ™็งๆƒ…ๅ†ตไธ‹๏ผŒ็Ž‹ไบ‘้นคๅ’Œไป–็š„ๅฐๆจกๅž‹ๅฎž้ชŒๅฎคๅ‡บๆ‰‹ไบ†ใ€‚ไป–ไปฌๅฃฐ็งฐๆ˜ฏไปŽๆ—ง็š„135Bๅ‚ๆ•ฐ็ปงๆ‰ฟๆ”น้€ ่€Œๆฅ๏ผŒ้€š่ฟ‡่ฎญ็ปƒ็Ÿญ็Ÿญ็š„ๅ‡ ็™พBๆ•ฐๆฎ๏ผŒๅ„้กนๆŒ‡ๆ ‡ๅนณๅ‡ๆๅ‡ไบ†ๅไธช็‚นๅทฆๅณใ€‚ๅฎž้™…ไธŠ๏ผŒ่ฟ™ๅฐฑๆ˜ฏไป–ไปฌๅฅ—ๅฃณๅบ”็”จๅˆฐๅคงๆจกๅž‹็š„็ฌฌไธ€ๆฌกๆฐไฝœใ€‚ๅŽไธบ็š„ๅค–่กŒ้ข†ๅฏผๅ†…่กŒ๏ผŒไฝฟๅพ—้ข†ๅฏผๅฎŒๅ…จๅฏนไบŽ่ฟ™็งๆ‰ฏๆทก็š„ไบ‹ๆƒ…ๆฒกๆœ‰ๆฆ‚ๅฟต๏ผŒไป–ไปฌๅชไผš่ง‰ๅพ—่‚ฏๅฎšๆ˜ฏๆœ‰ไป€ไนˆ็ฎ—ๆณ•ๅˆ›ๆ–ฐใ€‚็ป่ฟ‡ๅ†…้ƒจ็š„ๅˆ†ๆž๏ผŒไป–ไปฌๅฎž้™…ไธŠๆ˜ฏไฝฟ็”จQwen 1.5 110B็ปญ่ฎญ่€Œๆฅ๏ผŒ้€š่ฟ‡ๅŠ ๅฑ‚๏ผŒๆ‰ฉๅขžffn็ปดๅบฆ๏ผŒๆทปๅŠ ็›˜ๅคpi่ฎบๆ–‡็š„ไธ€ไบ›ๆœบๅˆถๅพ—ๆฅ๏ผŒๅ‡‘ๅคŸไบ†ๅคงๆฆ‚135B็š„ๅ‚ๆ•ฐใ€‚ๅฎž้™…ไธŠ๏ผŒๆ—ง็š„135Bๆœ‰107ๅฑ‚๏ผŒ่€Œ่ฟ™ไธชๆจกๅž‹ๅชๆœ‰82ๅฑ‚๏ผŒๅ„็ง้…็ฝฎไนŸ้ƒฝไธไธ€ๆ ทใ€‚ๆ–ฐ็š„ๆฅ่ทฏไธๆ˜Ž็š„135B่ฎญ็ปƒๅฎŒๅพˆๅคšๅ‚ๆ•ฐ็š„ๅˆ†ๅธƒไนŸๅ’ŒQwen 110Bๅ‡ ไนŽไธ€ๆจกไธ€ๆ ทใ€‚่ฟžๆจกๅž‹ไปฃ็ ็š„็ฑปๅๅฝ“ๆ—ถ้ƒฝๆ˜ฏQwen๏ผŒ็”š่‡ณๆ‡’ๅพ—ๆ”นๅใ€‚ๅŽ็ปญ่ฟ™ไธชๆจกๅž‹ๅฐฑๆ˜ฏๆ‰€่ฐ“็š„135B V2ใ€‚่€Œ่ฟ™ไธชๆจกๅž‹ๅฝ“ๆ—ถไนŸๆไพ›็ป™ไบ†ๅพˆๅคšไธ‹ๆธธ๏ผŒ็”š่‡ณๅŒ…ๆ‹ฌๅค–้ƒจๅฎขๆˆทใ€‚ - -่ฟ™ไปถไบ‹ๅฏนไบŽๆˆ‘ไปฌ่ฟ™ไบ›่ฎค็œŸ่ฏšๅฎžๅšไบ‹็š„ๅŒไบ‹ไปฌๅธฆๆฅไบ†ๅทจๅคง็š„ๅ†ฒๅ‡ป๏ผŒๅ†…้ƒจๅพˆๅคšไบบๅ…ถๅฎž้ƒฝ็Ÿฅ้“่ฟ™ไปถไบ‹๏ผŒ็”š่‡ณๅŒ…ๆ‹ฌ็ปˆ็ซฏๅ’ŒๅŽไธบไบ‘ใ€‚ๆˆ‘ไปฌ้ƒฝๆˆ็งฐไปฅๅŽๅˆซๅซ็›˜ๅคๆจกๅž‹ไบ†๏ผŒๅซๅƒๅคๅงใ€‚ๅฝ“ๆ—ถๅ›ข้˜Ÿๆˆๅ‘˜ๅฐฑๆƒณๅ‘bcgไธพๆŠฅไบ†๏ผŒๆฏ•็ซŸ่ฟ™ๅทฒ็ปๆ˜ฏ้‡ๅคง็š„ไธšๅŠก้€ ๅ‡ไบ†ใ€‚ไฝ†ๆ˜ฏๅŽ้ขๆฎ่ฏด่ขซ้ข†ๅฏผๆ‹ฆไบ†ไธ‹ๆฅ๏ผŒๅ› ไธบๆ›ด้ซ˜็บงๅˆซ็š„้ข†ๅฏผ๏ผˆๆฏ”ๅฆ‚ๅงš่€ๅธˆ๏ผŒไปฅๅŠๅฏ่ƒฝ็†Šๆ€ปๅ’ŒๆŸฅ่€๏ผ‰ๅ…ถๅฎžๅŽ้ขไนŸ็Ÿฅ้“ไบ†๏ผŒไฝ†ๆ˜ฏๅนถไธ็ฎก๏ผŒๅ› ไธบ้€š่ฟ‡ๅฅ—ๅฃณๆ‹ฟๅ‡บๅฅฝ็š„็ป“ๆžœ๏ผŒๅฏนไป–ไปฌไนŸๆ˜ฏๆœ‰ๅˆฉ็š„ใ€‚่ฟ™ไปถไบ‹ไฝฟๅพ—ๅฝ“ๆ—ถๅ›ข้˜Ÿๅ‡ ไฝๆœ€ๅผบ็š„ๅŒไบ‹ๅผ€ๅง‹ๅฟƒ็ฐๆ„ๅ†ท๏ผŒ็ฆป่Œ่ท‘่ทฏไนŸ้€ๆธๆˆไธบๆŒ‚ๅœจๅ˜ด่พน็š„ไบ‹ใ€‚ - -ๆญคๆ—ถ๏ผŒ็›˜ๅคไผผไนŽ่ฟŽๆฅไบ†่ฝฌๆœบใ€‚็”ฑไบŽๅ‰้ขๆ‰€่ฟฐ็š„่ฟ™ไบ›็›˜ๅคๆจกๅž‹ๅŸบๆœฌ้ƒฝๆ˜ฏ็ปญ่ฎญๅ’Œๆ”น้€ ่€Œๆฅ๏ผŒๅฝ“ๆ—ถ่ฏบไบšๅฎŒๅ…จๆฒกๆœ‰ๆŽŒๆกไปŽๅคด่ฎญ็ปƒ็š„ๆŠ€ๆœฏ๏ผŒไฝ•ๅ†ต่ฟ˜ๆ˜ฏๅœจๆ˜‡่…พ็š„NPUไธŠ่ฟ›่กŒ่ฎญ็ปƒใ€‚ๅœจๅฝ“ๆ—ถๅ›ข้˜Ÿ็š„ๆ ธๅฟƒๆˆๅ‘˜็š„ๆžๅŠ›ไบ‰ๅ–ไธ‹๏ผŒ็›˜ๅคๅผ€ๅง‹ไบ†็ฌฌไธ‰ไปฃๆจกๅž‹็š„่ฎญ็ปƒ๏ผŒไป˜ๅ‡บไบ†ๅทจๅคง็š„ๅŠชๅŠ›ๅŽ๏ผŒๅœจๆ•ฐๆฎๆžถๆž„ๅ’Œ่ฎญ็ปƒ็ฎ—ๆณ•ๆ–น้ข้ƒฝไธŽไธš็•Œ้€ๆธๆŽฅ่ฝจ๏ผŒ่€Œ่ฟ™ๅ…ถไธญ็š„่‰ฐ่พ›ๅ’Œๅฐๆจกๅž‹ๅฎž้ชŒๅฎค็š„ไบบไธ€็‚นๅ…ณ็ณป้ƒฝๆฒกๆœ‰ใ€‚ - -ไธ€ๅผ€ๅง‹ๅ›ข้˜Ÿๆˆๅ‘˜ๆฏซๆ— ไฟกๅฟƒ๏ผŒๅชไปŽไธ€ไธช13B็š„ๆจกๅž‹ๅผ€ๅง‹่ฎญ็ปƒ๏ผŒไฝ†ๆ˜ฏๅŽ้ขๅ‘็Žฐๆ•ˆๆžœ่ฟ˜ไธ้”™๏ผŒไบŽๆ˜ฏ่ฟ™ไธชๆจกๅž‹ๅŽ็ปญๅ†ๆฌก่ฟ›่กŒไบ†ไธ€ๆฌกๅ‚ๆ•ฐๆ‰ฉๅขž๏ผŒๅ˜ๆˆไบ†็ฌฌไธ‰ไปฃ็š„38B๏ผŒไปฃๅท38B V3ใ€‚ๆƒณๅฟ…ๅพˆๅคšไบงๅ“็บฟ็š„ๅ…„ๅผŸ้ƒฝๅฏน่ฟ™ไธชๆจกๅž‹ๅพˆ็†Ÿๆ‚‰ใ€‚ๅฝ“ๆ—ถ่ฟ™ไธชๆจกๅž‹็š„tokenizerๆ˜ฏๅŸบไบŽllama็š„่ฏ่กจ่ฟ›่กŒๆ‰ฉๅฑ•็š„๏ผˆไนŸๆ˜ฏไธš็•Œๅธธ่ง็š„ๅšๆณ•๏ผ‰ใ€‚่€Œๅฝ“ๆ—ถ็Ž‹ไบ‘้นค็š„ๅฎž้ชŒๅฎคๅšๅ‡บๆฅไบ†ๅฆไธ€ไธช่ฏ่กจ๏ผˆไนŸๅฐฑๆ˜ฏๅŽ็ปญpangu็ณปๅˆ—็š„่ฏ่กจ๏ผ‰ใ€‚ๅฝ“ๆ—ถไธคไธช่ฏ่กจ่ฟ˜่ขซ่ฟซ่ฟ›่กŒไบ†ไธ€ๆฌก่ต›้ฉฌ๏ผŒๆœ€็ปˆๆฒกๆœ‰ๆ˜Žๆ˜พ็š„ๅฅฝๅ็ป“่ฎบใ€‚ไบŽๆ˜ฏ๏ผŒ้ข†ๅฏผๅฝ“ๅณๅ†ณๅฎš๏ผŒๅบ”่ฏฅ็ปŸไธ€่ฏ่กจ๏ผŒไฝฟ็”จ็Ž‹ไบ‘้นคไป–ไปฌ็š„ใ€‚ไบŽๆ˜ฏ๏ผŒๅœจๅŽ็ปญไปŽๅคด่ฎญ็ปƒ็š„135B V3๏ผˆไนŸๅฐฑๆ˜ฏๅฏนๅค–็š„Pangu Ultra๏ผ‰๏ผŒไพฟๆ˜ฏ้‡‡็”จไบ†่ฟ™ไธชtokenizerใ€‚่ฟ™ไนŸ่งฃ้‡Šไบ†ๅพˆๅคšไฝฟ็”จๆˆ‘ไปฌๆจกๅž‹็š„ๅ…„ๅผŸ็š„็–‘ๆƒ‘๏ผŒไธบไป€ไนˆๅฝ“ๆ—ถๅŒไธบV3ไปฃ็š„ไธคไธชไธๅŒๆกฃไฝ็š„ๆจกๅž‹๏ผŒไผšไฝฟ็”จไธๅŒ็š„tokenizerใ€‚ - - -ๆˆ‘ไปฌๆ‰“ๅฟƒ็œผ้‡Œ่ง‰ๅพ—๏ผŒ135B V3ๆ˜ฏๆˆ‘ไปฌๅ››็บตๅ›ข้˜Ÿๅฝ“ๆ—ถ็š„้ช„ๅ‚ฒใ€‚่ฟ™ๆ˜ฏ็ฌฌไธ€ไธช็œŸๆญฃๆ„ไน‰ไธŠ็š„๏ผŒๅŽไธบๅ…จๆ ˆ่‡ช็ ”๏ผŒๆญฃ็ปไปŽๅคด่ฎญ็ปƒ็š„ๅƒไบฟ็บงๅˆซ็š„ๆจกๅž‹๏ผŒไธ”ๆ•ˆๆžœไธŽ24ๅนดๅŒๆœŸ็ซžๅ“ๅฏๆฏ”็š„ใ€‚ๅ†™ๅˆฐ่ฟ™้‡Œๆˆ‘ๅทฒ็ป็ƒญๆณช็›ˆ็œถ๏ผŒๅคชไธๅฎนๆ˜“ไบ†ใ€‚ๅฝ“ๆ—ถไธบไบ†็จณๅฎš่ฎญ็ปƒ๏ผŒๅ›ข้˜Ÿๅšไบ†ๅคง้‡ๅฎž้ชŒๅฏนๆฏ”๏ผŒๅนถไธ”ๅคšๆฌกๅœจๆจกๅž‹ๆขฏๅบฆๅ‡บ็Žฐๅผ‚ๅธธ็š„ๆ—ถๅ€™่ฟ›่กŒๅŠๆ—ถๅ›ž้€€้‡ๅฏใ€‚่ฟ™ไธชๆจกๅž‹็œŸๆญฃๅšๅˆฐไบ†ๅŽ้ขๆŠ€ๆœฏๆŠฅๅ‘Šๆ‰€่ฏด็š„่ฎญ็ปƒๅ…จ็จ‹ๆฒกๆœ‰ไธ€ไธชloss spikeใ€‚ๆˆ‘ไปฌๅ…‹ๆœไบ†ไธ็Ÿฅ้“ๅคšๅฐ‘ๅ›ฐ้šพ๏ผŒๆˆ‘ไปฌๅšๅˆฐไบ†๏ผŒๆˆ‘ไปฌๆ„ฟ็”จ็”Ÿๅ‘ฝๅ’Œ่ฃ่ช‰ไฟ่ฏ่ฟ™ไธชๆจกๅž‹่ฎญ็ปƒ็š„็œŸๅฎžๆ€งใ€‚ๅคšๅฐ‘ไธชๅ‡Œๆ™จ๏ผŒๆˆ‘ไปฌไธบไบ†ๅฎƒ็š„่ฎญ็ปƒ่€Œไธ็œ ใ€‚ๅœจ่ขซๅ†…้ƒจๅฟƒๅฃฐ้ช‚็š„ไธ€ๆ–‡ไธๅ€ผ็š„ๆ—ถๅ€™๏ผŒๆˆ‘ไปฌๆœ‰ๅคšไนˆไธ็”˜๏ผŒๆœ‰ๅคšๅฐ‘็š„ๅง”ๅฑˆ๏ผŒๆˆ‘ไปฌๆŒบไฝไบ†ใ€‚ - -ๆˆ‘ไปฌ่ฟ™ๅธฎไบบๆ˜ฏ็œŸ็š„ๅœจไธบๆ‰“็ฃจๅ›ฝไบง็ฎ—ๅŠ›ๅบ•ๅบง็‡ƒ็ƒง่‡ชๅทฑ็š„้’ๆ˜ฅๅ•Šโ€ฆโ€ฆๅฎขๅฑ…ไป–ไนก๏ผŒๆˆ‘ไปฌๆ”พๅผƒไบ†ๅฎถๅบญ๏ผŒๆ”พๅผƒไบ†ๅ‡ๆœŸ๏ผŒๆ”พๅผƒไบ†ๅฅๅบท๏ผŒๆ”พๅผƒไบ†ๅจฑไน๏ผŒๆŠ›ๅคด้ข…ๆด’็ƒญ่ก€๏ผŒๅ…ถไธญ็š„่‰ฐ่พ›ไธŽๅ›ฐ่‹ฆ๏ผŒๅฏฅๅฏฅๆ•ฐ็ฌ”ไธ่ถณไปฅๆฆ‚ๆ‹ฌๅ…ถไธ‡ไธ€ใ€‚ๅœจๅ„็งๅŠจๅ‘˜ๅคงไผšไธŠ๏ผŒๅฝ“ๆ—ถๅฃๅทไธญๅ–Šๅ‡บ็š„็›˜ๅคๅฟ…่ƒœ๏ผŒๅŽไธบๅฟ…่ƒœ๏ผŒๆˆ‘ไปฌๅฟƒ้‡Œๆ˜ฏ็œŸ็š„ๆทฑๆทฑ่ขซๆ„ŸๅŠจใ€‚ - -็„ถ่€Œ๏ผŒๆˆ‘ไปฌ็š„ๆ‰€ๆœ‰่พ›่‹ฆ็š„ๆˆๆžœ๏ผŒ็ปๅธธ่ขซๅฐๆจกๅž‹ๅฎž้ชŒๅฎค่ฝป้ฃ˜้ฃ˜็š„ๆ‹ฟ่ตฐไบ†ใ€‚ๆ•ฐๆฎ๏ผŒ็›ดๆŽฅ่ฆ่ตฐใ€‚ไปฃ็ ๏ผŒ็›ดๆŽฅ่ฆ่ตฐ๏ผŒ่ฟ˜่ฆๆฑ‚ๆˆ‘ไปฌ้…ๅˆ้€‚้…ๅˆฐ่ƒฝไธ€้”ฎ่ฟ่กŒใ€‚ๆˆ‘ไปฌๅฝ“ๆ—ถๆˆ็งฐๅฐๆจกๅž‹ๅฎž้ชŒๅฎคไธบ็‚น้ผ ๆ ‡ๅฎž้ชŒๅฎคใ€‚ๆˆ‘ไปฌไป˜ๅ‡บ่พ›่‹ฆ๏ผŒไป–ไปฌๅ–ๅพ—่ฃ่€€ใ€‚ๆžœ็„ถๅบ”ไบ†้‚ฃๅฅ่ฏ๏ผŒไฝ ๅœจ่ดŸ้‡ๅ‰่กŒๆ˜ฏๅ› ไธบๆœ‰ไบบๆ›ฟไฝ ๅฒๆœˆ้™ๅฅฝใ€‚ๅœจ่ฟ™็งๆƒ…ๅ†ตไธ‹๏ผŒ่ถŠๆฅ่ถŠๅคš็š„ๆˆ˜ๅ‹ๅ†ไนŸๅšๆŒไธไธ‹ๅŽปไบ†๏ผŒ้€‰ๆ‹ฉไบ†็ฆปๅผ€ใ€‚็œ‹ๅˆฐ่บซ่พน้‚ฃไบ›ไผ˜็ง€็š„ๅŒไบ‹ไธ€ไธชไธช็ฆป่Œ๏ผŒๆˆ‘็š„ๅ†…ๅฟƒๅˆๆ„Ÿๅนๅˆ้šพ่ฟ‡ใ€‚ๅœจ่ฟ™็งไฝœๆˆ˜ไธ€ๆ ท็š„็Žฏๅขƒไธ‹๏ผŒๆˆ‘ไปฌๆฏ”่ตทๅŒไบ‹ๆฅ่ฏดๆ›ดๅƒๆ˜ฏๆˆ˜ๅ‹ใ€‚ไป–ไปฌๅœจๆŠ€ๆœฏไธŠไนŸๆœ‰ๆ— ๆ•ฐๅ€ผๅพ—ๆˆ‘ๅญฆไน ็š„ๅœฐๆ–น๏ผŒๅ ช็งฐ่‰ฏๅธˆใ€‚็œ‹ๅˆฐไป–ไปฌๅŽปไบ†่ฏธๅฆ‚ๅญ—่Š‚Seed๏ผŒDeepseek๏ผŒๆœˆไน‹ๆš—้ข๏ผŒ่…พ่ฎฏๅ’Œๅฟซๆ‰‹็ญ‰็ญ‰ๅพˆๅคšๅ‡บ่‰ฒ็š„ๅ›ข้˜Ÿ๏ผŒๆˆ‘ๆ‰“ๅฟƒ็œผ้‡Œไธบไป–ไปฌ้ซ˜ๅ…ดๅ’Œ็ฅ็ฆ๏ผŒ่„ฑ็ฆปไบ†่ฟ™ไธช่พ›่‹ฆๅด่‚ฎ่„็š„ๅœฐๆ–นใ€‚ๆˆ‘่‡ณไปŠ่ฟ˜ๅฏนไธ€ไฝ็ฆป่ŒๅŒไบ‹็š„่ฏ่ฎฐๅฟ†็Šนๆ–ฐ๏ผŒta่ฏด๏ผšโ€œๆฅ่ฟ™้‡Œๆ˜ฏๆˆ‘ๆŠ€ๆœฏ็”Ÿๆถฏไธญ็š„่€ป่พฑ๏ผŒๅœจ่ฟ™้‡Œๅ†ๅ‘†ๆฏไธ€ๅคฉ้ƒฝๆ˜ฏๆตช่ดน็”Ÿๅ‘ฝโ€ใ€‚่ฏ่™ฝ้šพๅฌๅด่ฎฉๆˆ‘ๆ— ่จ€ไปฅๅฏนใ€‚ๆˆ‘ๆ‹…ๅฟƒๆˆ‘่‡ชๅทฑๆŠ€ๆœฏๆ–น้ข็š„็งฏ็ดฏไธ่ถณ๏ผŒไปฅๅŠๆฒกๆณ•้€‚ๅบ”ไบ’่”็ฝ‘ๅ…ฌๅธ้ซ˜ๆท˜ๆฑฐ็š„็Žฏๅขƒ๏ผŒ่ฎฉๆˆ‘ๅคšๆฌกๆƒณ็ฆป่Œ็š„ๅฟƒๅง‹็ปˆๆฒกๆœ‰่ฟˆๅ‡บ่ฟ™ไธ€ๆญฅใ€‚ - -็›˜ๅค้™คไบ†denseๆจกๅž‹๏ผŒๅŽ็ปญไนŸๅฏๅŠจไบ†moe็š„ๆŽข็ดขใ€‚ไธ€ๅผ€ๅง‹่ฎญ็ปƒ็š„ๆ˜ฏไธ€ไธช224B็š„moeๆจกๅž‹ใ€‚่€ŒไธŽไน‹ๅนณ่กŒ็š„๏ผŒๅฐๆจกๅž‹ๅฎž้ชŒๅฎคไนŸๅผ€ๅฏไบ†็ฌฌไบŒๆฌกไธป่ฆ็š„ๅฅ—ๅฃณ่กŒๅŠจ๏ผˆๆฌก่ฆ็š„ๆ’ๆ›ฒๅฏ่ƒฝ่ฟ˜ๅŒ…ๆ‹ฌไธ€ไบ›ๅˆซ็š„ๆจกๅž‹๏ผŒๆฏ”ๅฆ‚mathๆจกๅž‹๏ผ‰๏ผŒๅณ่ฟ™ๆฌกๆตไผ ็”šๅนฟ็š„pangu pro moe 72Bใ€‚่ฟ™ไธชๆจกๅž‹ๅ†…้ƒจ่‡ช็งฐๆ˜ฏไปŽๅฐๆจกๅž‹ๅฎž้ชŒๅฎค็š„7Bๆ‰ฉๅขžไธŠๆฅ็š„๏ผˆๅฐฑ็ฎ—ๅฆ‚ๆญค๏ผŒ่ฟ™ไนŸไธŽๆŠ€ๆœฏๆŠฅๅ‘Šไธ็ฌฆ๏ผŒไฝ•ๅ†ตๆ˜ฏๅฅ—ๅฃณqwen 2.5็š„14b็ปญ่ฎญ๏ผ‰ใ€‚่ฟ˜่ฎฐๅพ—ไป–ไปฌ่ฎญไบ†ๆฒกๅ‡ ๅคฉ๏ผŒๅ†…้ƒจ็š„่ฏ„ๆต‹ๅฐฑ็ซ‹ๅˆป่ฟฝไธŠไบ†ๅฝ“ๆ—ถ็š„38B V3ใ€‚AI็ณป็ปŸๅฎž้ชŒๅฎคๅพˆๅคšๅ…„ๅผŸๅ› ไธบ้œ€่ฆ้€‚้…ๆจกๅž‹๏ผŒ้ƒฝ็Ÿฅ้“ไป–ไปฌ็š„ๅฅ—ๅฃณ่กŒๅŠจ๏ผŒๅชๆ˜ฏ่ฟซไบŽๅ„็งๅŽŸๅ› ๏ผŒๆ— ๆณ•ไผธๅผ ๆญฃไน‰ใ€‚ๅฎž้™…ไธŠ๏ผŒๅฏนไบŽๅŽ็ปญ่ฎญไบ†ๅพˆไน…ๅพˆไน…็š„่ฟ™ไธชๆจกๅž‹๏ผŒHonestagi่ƒฝๅคŸๅˆ†ๆžๅ‡บ่ฟ™ไธช้‡็บง็š„็›ธไผผๆ€งๆˆ‘ๅทฒ็ปๅพˆ่ฏงๅผ‚ไบ†๏ผŒๅ› ไธบ่ฟ™ไธชๆจกๅž‹ไธบไบ†็ปญ่ฎญๆด—ๅ‚ๆ•ฐ๏ผŒๆ‰€ไป˜ๅ‡บ็š„็ฎ—ๅŠ›็”š่‡ณๆ—ฉๅฐฑ่ถณๅคŸไปŽๅคด่ฎญไธ€ไธชๅŒๆกฃไฝ็š„ๆจกๅž‹ไบ†ใ€‚ๅฌๅŒไบ‹่ฏดไป–ไปฌไธบไบ†ๆด—ๆމๅƒ้—ฎ็š„ๆฐดๅฐ๏ผŒ้‡‡ๅ–ไบ†ไธๅฐ‘ๅŠžๆณ•๏ผŒ็”š่‡ณๅŒ…ๆ‹ฌๆ•…ๆ„่ฎญไบ†่„ๆ•ฐๆฎใ€‚่ฟ™ไนŸไธบๅญฆๆœฏ็•Œ็ ”็ฉถๆจกๅž‹่ก€็ผ˜ๆไพ›ไบ†ไธ€ไธชๅ‰ๆ‰€ๆœชๆœ‰็š„็‰นๆฎŠๆจก่Œƒๅงใ€‚ไปฅๅŽๆ–ฐ็š„่ก€็ผ˜ๆ–นๆณ•ๆๅ‡บๅฏไปฅๆ‹ฟๅ‡บๆฅๆบœๆบœใ€‚ - -24ๅนดๅบ•ๅ’Œ25ๅนดๅˆ๏ผŒๅœจDeepseek v3ๅ’Œr1ๅ‘ๅธƒไน‹ๅŽ๏ผŒ็”ฑไบŽๅ…ถๆƒŠ่‰ณ็š„ๆŠ€ๆœฏๆฐดๅนณ๏ผŒๅ›ข้˜Ÿๅ—ๅˆฐไบ†ๅทจๅคง็š„ๅ†ฒๅ‡ป๏ผŒไนŸๅ—ๅˆฐไบ†ๆ›ดๅคง็š„่ดจ็–‘ใ€‚ไบŽๆ˜ฏไธบไบ†็ดง่ทŸๆฝฎๆต๏ผŒ็›˜ๅคๆจกไปฟDeepseek็š„ๆจกๅž‹ๅฐบๅฏธ๏ผŒๅผ€ๅฏไบ†718B moe็š„่ฎญ็ปƒใ€‚่ฟ™ไธชๆ—ถๅ€™๏ผŒๅฐๆจกๅž‹ๅฎž้ชŒๅฎคๅ†ๆฌกๅ‡บๆ‰‹ไบ†ใ€‚ไป–ไปฌ้€‰ๆ‹ฉไบ†ๅฅ—ๅฃณDeepseekv3็ปญ่ฎญใ€‚ไป–ไปฌ้€š่ฟ‡ๅ†ปไฝDeepseekๅŠ ่ฝฝ็š„ๅ‚ๆ•ฐ๏ผŒ่ฟ›่กŒ่ฎญ็ปƒใ€‚่ฟžไปปๅŠกๅŠ ่ฝฝckpt็š„็›ฎๅฝ•้ƒฝๆ˜ฏdeepseekv3๏ผŒๆ”น้ƒฝไธๆ”น๏ผŒไฝ•ๅ…ถๅšฃๅผ ๏ผŸไธŽไน‹็›ธๅ๏ผŒไธ€ไบ›ๆœ‰็œŸๆญฃๆŠ€ๆœฏไฟกไปฐ็š„ๅŒไบ‹๏ผŒๅœจไปŽๅคด่ฎญ็ปƒๅฆไธ€ไธช718B็š„moeใ€‚ไฝ†ๅ…ถไธญๅ‡บ็Žฐไบ†ๅ„็งๅ„ๆ ท็š„้—ฎ้ข˜ใ€‚ไฝ†ๆ˜ฏๅพˆๆ˜พ็„ถ๏ผŒ่ฟ™ไธชๆจกๅž‹ๆ€Žไนˆๅฏ่ƒฝๆฏ”็›ดๆŽฅๅฅ—ๅฃณ็š„ๅฅฝๅ‘ข๏ผŸๅฆ‚ๆžœไธๆ˜ฏๅ›ข้˜ŸleaderๅšๆŒ๏ผŒๆ—ฉๅฐฑ่ขซๅซๅœไบ†ใ€‚ - -ๅŽไธบ็š„ๆต็จ‹็ฎก็†ไน‹็น้‡๏ผŒไธฅ้‡ๆ‹–็ดฏไบ†ๅคงๆจกๅž‹็š„็ ”ๅ‘่Š‚ๅฅ๏ผŒไพ‹ๅฆ‚็‰ˆๆœฌ็ฎก็†๏ผŒๆจกๅž‹่ก€็ผ˜๏ผŒๅ„็งๆต็จ‹ๅŒ–๏ผŒๅ„็งๅฏ่ฟฝๆบฏใ€‚่ฎฝๅˆบ็š„ๆ˜ฏ๏ผŒๅฐๆจกๅž‹ๅฎž้ชŒๅฎค็š„ๆจกๅž‹ไผผไนŽไปŽๆฅไธๅ—่ฟ™ไบ›ๆต็จ‹็š„็บฆๆŸ๏ผŒๆƒณๅฅ—ๅฃณๅฐฑๅฅ—ๅฃณ๏ผŒๆƒณ็ปญ่ฎญๅฐฑ็ปญ่ฎญ๏ผŒ็ฎ—ๅŠ›ๆบๆบไธๆ–ญ็š„ไผธๆ‰‹ๆ‹ฟ่ตฐใ€‚่ฟ™็งๅผบ็ƒˆๅˆฐ่ฟ‘ไนŽ้ญ”ๅนป็š„ๅฏนๆฏ”๏ผŒ่ฏดๆ˜Žไบ†ๅฝ“ๅ‰ๆต็จ‹็ฎก็†็š„ๆƒ…ๅ†ต๏ผšๅช่ฎธๅทžๅฎ˜ๆ”พ็ซ๏ผŒไธ่ฎธ็™พๅง“็‚น็ฏใ€‚ไฝ•ๅ…ถๅฏ็ฌ‘๏ผŸไฝ•ๅ…ถๅฏๆ‚ฒ๏ผŸไฝ•ๅ…ถๅฏๆถ๏ผŸไฝ•ๅ…ถๅฏ่€ป๏ผ - -HonestAGI็š„ไบ‹ๆƒ…ๅ‡บๆฅๅŽ๏ผŒๅ†…้ƒจ่ฎฉๅคงๅฎถไธๅœ็š„็ ”่ฎจๅˆ†ๆž๏ผŒๅฆ‚ไฝ•ๅ…ฌๅ…ณๅ’Œโ€œๅ›žๅบ”โ€ใ€‚่ฏš็„ถ๏ผŒ่ฟ™ไธชๅŽŸๆ–‡็š„ๅˆ†ๆžไนŸ่ฎธไธๅคŸๆœ‰ๅŠ›๏ผŒ็ป™ไบ†็Ž‹ไบ‘้นคไธŽๅฐๆจกๅž‹ๅฎž้ชŒๅฎคไป–ไปฌ็‹ก่พฉๅ’Œ้ข ๅ€’้ป‘็™ฝ็š„ๆœบไผšใ€‚ไธบๆญค๏ผŒ่ฟ™ไธคๅคฉๆˆ‘ๅ†…ๅฟƒๆ„Ÿๅˆฐไฝœๅ‘•๏ผŒๆ—ถๆ—ถๆ€€็–‘่‡ชๅทฑ็š„ไบบ็”Ÿๆ„ไน‰ไปฅๅŠ่‹ๅคฉๆ— ็œผใ€‚ๆˆ‘ไธๅฅ‰้™ชไบ†๏ผŒๆˆ‘่ฆ็ฆป่Œไบ†๏ผŒๅŒๆ—ถๆˆ‘ไนŸๅœจ็”ณ่ฏทไปŽ็›˜ๅค้ƒจๅˆ†ๆŠ€ๆœฏๆŠฅๅ‘Š็š„ไฝœ่€…ๅๅ•ไธญ็งป้™คใ€‚ๆ›พ็ปๅœจ่ฟ™ไบ›ๆŠ€ๆœฏๆŠฅๅ‘ŠไธŠ็ฝฒๅๆ˜ฏๆˆ‘ไธ€็”Ÿ้ƒฝๆ— ๆณ•ๆŠน้™ค็š„ๆฑก็‚นใ€‚ๅฝ“ๆ—ถๆˆ‘ๆฒกๆƒณๅˆฐ๏ผŒไป–ไปฌ็ซŸ็„ถ็Œ–็‹‚ๅˆฐๆ•ขๅผ€ๆบใ€‚ๆˆ‘ๆฒกๆƒณๅˆฐ๏ผŒไป–ไปฌๆ•ขๅฆ‚ๆญคๆ„šๅผ„ไธ–ไบบ๏ผŒๅคง่‚†ๅฎฃๅ‘ใ€‚ๅฝ“ๆ—ถ๏ผŒๆˆ‘ไนŸ่ฎธๆ˜ฏๅญ˜ไบ†ไพฅๅนธๅฟƒ็†๏ผŒๆฒกๆœ‰ๆ‹’็ป็ฝฒๅใ€‚ๆˆ‘็›ธไฟกๅพˆๅคšๆ‰Žๅฎžๅšไบ‹็š„ๆˆ˜ๅ‹๏ผŒไนŸๅชๆ˜ฏ่ขซ่ฟซไธŠไบ†่ดผ่ˆน๏ผŒๆˆ–่€…ไธ็Ÿฅๆƒ…ใ€‚ไฝ†่ฟ™ไปถไบ‹ๅทฒ็ปๆ— ๆณ•ๆŒฝๅ›ž๏ผŒๆˆ‘ๅธŒๆœ›ๆˆ‘็š„ไฝ™็”Ÿ่ƒฝๅคŸๅšๆŒๆ‰Žๅฎžๅš็œŸๆญฃๆœ‰ๆ„ไน‰็š„ไบ‹๏ผŒไธบๆˆ‘ๅฝ“ๆ—ถ็š„่ฝฏๅผฑๅ’Œไธๅšๅฎš่ตŽ็ฝชใ€‚ - -ๆทฑๅคœๅ†™ๅˆฐ่ฟ™้‡Œ๏ผŒๆˆ‘ๅทฒ็ปๆณชๆตๆปก้ข๏ผŒๆณฃไธๆˆๅฃฐใ€‚่ฟ˜่ฎฐๅพ—ไธ€ไบ›ๅ‡บ่‰ฒ็š„ๅŒไบ‹็ฆป่Œๆ—ถ๏ผŒๆˆ‘่‹ฆ็ฌ‘้—ฎไป–ไปฌ่ฆไธ่ฆๅ‘ไธช้•ฟ้•ฟ็š„ๅฟƒๅฃฐๆƒฏไพ‹ๅธ–๏ผŒๆญ้œฒไธ€ไธ‹็Žฐ็Šถใ€‚ๅฏนๆ–น่ฏด๏ผšไธไบ†๏ผŒๆตช่ดนๆ—ถ้—ด๏ผŒ่€Œไธ”ๆˆ‘ไนŸๆ€•ๆญ้œฒๅ‡บๆฅไฝ ไปฌ่ฟ‡็š„ๆ›ด็ณŸใ€‚ๆˆ‘ๅฝ“ๆ—ถไธ€ไธ‹้ปฏ็„ถ็ฅžไผค๏ผŒๅ› ไธบๆ›พ็ปๅ…ฑๅŒไธบไบ†็†ๆƒณๅฅ‹ๆ–—่ฟ‡็š„ๆˆ˜ๅ‹ๅทฒ็ปๅฝปๅบ•ๅฏนๅŽไธบๅฝปๅบ•็ฐๅฟƒไบ†ใ€‚ๅฝ“ๆ—ถๅคงๅฎถ่ฐƒไพƒ๏ผŒๆˆ‘ไปฌ็”จ็€ๅฝ“ๅนดๅ…ฑไบงๅ…š็š„ๅฐ็ฑณๅŠ ๆญฅๆžช๏ผŒ็ป„็ป‡ๅดๆœ‰็€ๅ ชๆฏ”ๅฝ“ๅนดๅ›ฝๆฐ‘ๅ…š็š„ไฝœ้ฃŽใ€‚ - -ๆ›พๅ‡ ไฝ•ๆ—ถ๏ผŒๆˆ‘ไธบๆˆ‘ไปฌ็”จ็€ๅฐ็ฑณๅŠ ๆญฅๆžชๆ‰“่ดฅๆด‹ๆžชๆด‹็‚ฎ่€Œ่‡ช่ฑชใ€‚ - -็Žฐๅœจ๏ผŒๆˆ‘็ดฏไบ†๏ผŒๆˆ‘ๆƒณๆŠ•้™ใ€‚ - -ๅ…ถๅฎžๆ—ถ่‡ณไปŠๆ—ฅ๏ผŒๆˆ‘่ฟ˜ๆ˜ฏ็œŸๅฟƒๅธŒๆœ›ๅŽไธบ่ƒฝ่ฎค็œŸๅธๅ–ๆ•™่ฎญ๏ผŒ่ƒฝๅšๅฅฝ็›˜ๅค๏ผŒๆŠŠ็›˜ๅคๅšๅˆฐไธ–็•Œไธ€ๆต๏ผŒๆŠŠๆ˜‡่…พๅ˜ๆˆ่‹ฑไผŸ่พพ็š„ๆฐดๅนณใ€‚ๅ†…้ƒจ็š„ๅŠฃๅธ้ฉฑ้€่‰ฏๅธ๏ผŒไฝฟๅพ—่ฏบไบšไนƒ่‡ณๅŽไธบๅœจ็Ÿญๆ—ถ้—ดๅ†…ๆ€ฅๅ‰งๆตๅคฑไบ†ๅคง้‡ๅ‡บ่‰ฒ็š„ๅคงๆจกๅž‹ไบบๆ‰ใ€‚็›ธไฟกไป–ไปฌไนŸๆญฃๅœจๅฆ‚Deepseek็ญ‰ๅ„ไธชๅ›ข้˜Ÿ้—ช่€€็€๏ผŒๆ–ฝๅฑ•็€ไป–ไปฌ็š„ๆŠฑ่ดŸๆ‰ๅŽ๏ผŒไธบไธญ็พŽๅœจAI็š„ๆฟ€็ƒˆ็ซž่ต›ไธญๅฅ‰็ŒฎๅŠ›้‡ใ€‚ๆˆ‘ๆ—ถๅธธๆ„Ÿๅน๏ผŒๅŽไธบไธๆ˜ฏๆฒกๆœ‰ไบบๆ‰๏ผŒ่€Œๆ˜ฏๆ นๆœฌไธ็Ÿฅ้“ๆ€Žไนˆ็•™ไฝไบบๆ‰ใ€‚ๅฆ‚ๆžœ็ป™่ฟ™ไบ›ไบบๅˆ้€‚็š„็Žฏๅขƒ๏ผŒๅˆ้€‚็š„่ต„ๆบ๏ผŒๆ›ดๅฐ‘็š„ๆžท้”๏ผŒๆ›ดๅฐ‘็š„ๆ”ฟๆฒปๆ–—ไบ‰๏ผŒ็›˜ๅคไฝ•ๆ„ไธๆˆ๏ผŸ - -ๆœ€ๅŽ๏ผšๆˆ‘ไปฅ็”Ÿๅ‘ฝ๏ผŒไบบๆ ผๅ’Œ่ฃ่ช‰ๅ‘่ช“๏ผŒๆˆ‘ๅ†™็š„ไปฅไธŠๆ‰€ๆœ‰ๅ†…ๅฎนๅ‡ไธบ็œŸๅฎž๏ผˆ่‡ณๅฐ‘ๅœจๆˆ‘ๆœ‰้™็š„่ฎค็Ÿฅ่Œƒๅ›ดๅ†…๏ผ‰ใ€‚ๆˆ‘ๆฒกๆœ‰้‚ฃไนˆ้ซ˜็š„ๆŠ€ๆœฏๆฐดๅนณไปฅๅŠๆœบไผšๅŽปๅš่ฏฆๅฐฝๆ‰Žๅฎž็š„ๅˆ†ๆž๏ผŒไนŸไธๆ•ข็›ดๆŽฅ็”จๅ†…้ƒจ่ฎฐๅฝ•ไธพ่ฏ๏ผŒๆ€•ๅ› ไธบไฟกๆฏๅฎ‰ๅ…จๆŠ“ๅˆฐใ€‚ไฝ†ๆ˜ฏๆˆ‘็›ธไฟกๆˆ‘ๅพˆๅคšๆ›พ็ป็š„ๆˆ˜ๅ‹๏ผŒไผšไธบๆˆ‘ไฝœ่ฏใ€‚ๅœจๅŽไธบๅ†…้ƒจ็š„ๅ…„ๅผŸ๏ผŒๅŒ…ๆ‹ฌๆˆ‘ไปฌๆ›พ็ปๆœๅŠก่ฟ‡็š„ไบงๅ“็บฟๅ…„ๅผŸไปฌ๏ผŒ็›ธไฟกๆœฌๆ–‡็š„ๆ— ๆ•ฐ็ป†่Š‚่ƒฝๅ’Œไฝ ไปฌ็š„ๅฐ่ฑกๅฏน็…ง๏ผŒๅฐ่ฏๆˆ‘็š„่ฏดๆณ•ใ€‚ไฝ ไปฌๅฏ่ƒฝไนŸๆ›พ็ป่ขซ่’™้ช—๏ผŒไฝ†่ฟ™ไบ›ๆฎ‹้…ท็š„็œŸ็›ธไธไผš่ขซๅฐ˜ๅฐใ€‚ๆˆ‘ไปฌๅฅ‹ๆˆ˜่ฟ‡็š„็—•่ฟน๏ผŒไนŸไธๅบ”่ฏฅ่ขซๆ‰ญๆ›ฒๅ’ŒๅŸ‹่‘ฌใ€‚ - -ๅ†™ไบ†่ฟ™ไนˆๅคš๏ผŒๆŸไบ›ไบบ่‚ฏๅฎšๆƒณๆŠŠๆˆ‘ๆ‰พๅ‡บๆฅ๏ผŒๆŠนๆ€ๆމใ€‚ๅ…ฌๅธๆžไธๅฅฝไนŸๆƒณ่ฎฉๆˆ‘ๅ™คๅฃฐไนƒ่‡ณ่ฟฝ่ดฃใ€‚ๅฆ‚ๆžœ็œŸ็š„่ฟ™ๆ ท๏ผŒๆˆ‘๏ผŒไนƒ่‡ณๆˆ‘็š„ๅฎถไบบ็š„ไบบ่บซไนƒ่‡ณ็”Ÿๅ‘ฝๅฎ‰ๅ…จๅฏ่ƒฝ้ƒฝไผšๅ—ๅˆฐๅจ่ƒใ€‚ไธบไบ†่‡ชๆˆ‘ไฟๆŠค๏ผŒๆˆ‘่ฟ‘ๆœŸๆฏๅคฉไผš่ทŸๅคงๅฎถๆŠฅๅนณๅฎ‰ใ€‚ - -ๅฆ‚ๆžœๆˆ‘ๆถˆๅคฑไบ†๏ผŒๅฐฑๅฝ“ๆ˜ฏๆˆ‘ไธบไบ†็œŸ็†ๅ’Œ็†ๆƒณ๏ผŒไธบไบ†ๅŽไธบไนƒ่‡ณไธญๅ›ฝ่ƒฝๅคŸๆ›ดๅฅฝๅœฐๅ‘ๅฑ•็ฎ—ๅŠ›ๅ’ŒAI่€Œ็‰บ็‰ฒไบ†ๅง๏ผŒๆˆ‘ๆ„ฟๅŸ‹่‘ฌไบŽ้‚ฃ็‰‡ๆ›พ็ปๅฅ‹ๆ–—่ฟ‡็š„ๅœฐๆ–นใ€‚ - -่ฏบไบš๏ผŒๅ†่ง - -2025ๅนด7ๆœˆ6ๆ—ฅๅ‡Œๆ™จ ๅ†™ไบŽๆทฑๅœณ - ---- - -ๅ„ไฝๅฅฝ๏ผŒ - -ๆ„Ÿ่ฐขๅคงๅฎถ็š„ๅ…ณๅฟƒไธŽ็ฅ็ฆใ€‚ๆˆ‘็›ฎๅ‰ๆš‚ๆ—ถๅฎ‰ๅ…จ๏ผŒไฝ†ๅ…ฌๅธๅบ”่ฏฅๅœจ่ฟ›่กŒๆŽ’ๆŸฅไธŽๆŸไบ›ๅๅ•ๆ”ถ้›†๏ผŒๅŽ็ปญๆƒ…ๅ†ตๆœช็Ÿฅใ€‚ - -ๆˆ‘่กฅๅ……ไธ€ไบ›็ป†่Š‚๏ผŒไปฅๅ…ๆŸไบ›ไบบ็ปง็ปญ้ข ๅ€’้ป‘็™ฝใ€‚ - -ๅ…ณไบŽ135B V2๏ผŒๅฐๆจกๅž‹ๅฎž้ชŒๅฎคๅœจ่ฟ…้€ŸๅœฐๅฎŒๆˆๅฅ—ๅฃณๅนถๆ‹ฟๅฎŒๆ‰€ๆœ‰ๅฅ—ๅฃณๅธฆๆฅ็š„ๅฅฝๅค„ๅŽ๏ผˆๆฏ”ๅฆ‚ไปปๅŠกไปค่กจๅฝฐๅ’ŒๅŠๆ—ถๆฟ€ๅŠฑ๏ผ‰๏ผŒๅ› ไธบไธๆƒณ็ปง็ปญๆ”ฏๆ’‘ไธ‹ๆธธๅบ”็”จๅ’Œๆจกๅž‹่ฟญไปฃ๏ผŒๅˆๆŠŠ่ฟ™ไธช็ƒซๆ‰‹ๅฑฑ่Š‹็”ฉ็ป™ไบ†ๅ››็บตใ€‚็กฎๅฎžๆŠ€้ซ˜ไธ€็ญน๏ผŒ็›ดๆŽฅๆŠŠๅ››็บต็š„ๅ…„ๅผŸไปฌๆ‹‰ไธ‹ๆฐดใ€‚ๅŒไบ‹ๆไพ›่ฟ‡ๅŽปไธ€ไธช่€ๆ—ง็š„ๆจกๅž‹๏ผŒๆœ€็ปˆๆ‹ฟๅ›žไบ†ไธ€ไธชๅฝ“ๆ—ถไธ€ไธช้ญ”ๆ”น็š„ๅ…ˆ่ฟ›็š„ๅƒ้—ฎใ€‚ๅšๅคงๆจกๅž‹็š„ไบบ๏ผŒ่‡ชๅทฑๅš็š„ๆจกๅž‹ๅฐฑๅƒ่‡ชๅทฑๅญฉๅญไธ€ๆ ท็†Ÿๆ‚‰๏ผŒไธ่ฆๆŠŠๅˆซไบบ้ƒฝๅฝ“ๅ‚ปๅญใ€‚ๅฐฑๅƒ่‡ชๅฎถๅ„ฟๅญๅ‡บ้—จไธ€่ถŸ๏ผŒๅ›žๆฅไธชๅˆซไบบๅฎถๅญฉๅญใ€‚ - -็›˜ๅคreport็š„็ฝฒๅๆ˜ฏไธ็ฌฆๅˆๅญฆๆœฏ่ง„่Œƒ็š„ใ€‚ไพ‹ๅฆ‚๏ผŒ135B V3ๆœ‰ไธๅฐ‘ๆœ‰ๆŠ€ๆœฏ่ดก็Œฎ็š„ไบบ๏ผŒๅ› ไธบไฝœ่€…ๅ้ขๆ•ฐ้‡้™ๅˆถ๏ผŒๅŠณๅŠจๆˆๆžœๆฒกๆœ‰ๅพ—ๅˆฐๅบ”ๆœ‰็š„ๅ›žๆŠฅ๏ผŒๅ›ข้˜Ÿๅ†…ๆ›พ็ปๆœ‰ไธๅฐ็š„ๆ„่งใ€‚่ฟ™ไธชๆจกๅž‹ๅฝ“ๆ—ถๆ˜ฏๅคงๅฎถๆ™บๆ…งๅ’Œๆฑ—ๆฐด็š„็ป“ๆ™ถ๏ผŒ็”š่‡ณๆ˜ฏๅ›ข้˜Ÿๅฝ“ๆ—ถ็š„็ฒพ็ฅžๆ”ฏๆŸฑ๏ผŒๆ”ฏๆ’‘็€ไธๅฐ‘ๅ…„ๅผŸไปฌ็ปง็ปญ็•™ๅœจ่ฏบไบšใ€‚ๆ‰€่ฐ“็š„ๅ้ข้™ๅˆถ๏ผŒไปฅๅŠๆŒ‚ๅไบ†ไธ€ไบ›ๆฏซๆ— ๆŠ€ๆœฏ่ดก็Œฎ็š„ไบบ๏ผˆๅฆ‚ไธ€ไบ›ๅฐๆจกๅž‹ๅฎž้ชŒๅฎค็š„ไบบ๏ผ‰๏ผŒ่ฎฉๅ…„ๅผŸไปฌไฝ•ๅ…ถๅฟƒๅฏ’ใ€‚ - ---- - -ๆš‚ๆ—ถๅนณๅฎ‰ใ€‚ๅฆๅค–๏ผŒๆ”ฏๆŒๆˆ‘ๅ‹‡ไบŽ่ฏดๅ‡บ็œŸ็›ธ็š„ๆˆ˜ๅ‹ไปฌ https://github.com/HW-whistleblower/True-Story-of-Pangu/issues/317 diff --git a/examples/document_search.py b/examples/document_search.py deleted file mode 100644 index fdb9167..0000000 --- a/examples/document_search.py +++ /dev/null @@ -1,158 +0,0 @@ -#!/usr/bin/env python3 -""" -Document search demo with recompute mode -""" - -import shutil -import time -from pathlib import Path - -# Import backend packages to trigger plugin registration -try: - import leann_backend_diskann # noqa: F401 - import leann_backend_hnsw # noqa: F401 - - print("INFO: Backend packages imported successfully.") -except ImportError as e: - print(f"WARNING: Could not import backend packages. Error: {e}") - -# Import upper-level API from leann-core -from leann.api import LeannBuilder, LeannChat, LeannSearcher - - -def load_sample_documents(): - """Create sample documents for demonstration""" - docs = [ - { - "title": "Intro to Python", - "content": "Python is a high-level, interpreted language known for simplicity.", - }, - { - "title": "ML Basics", - "content": "Machine learning builds systems that learn from data.", - }, - { - "title": "Data Structures", - "content": "Data structures like arrays, lists, and graphs organize data.", - }, - ] - return docs - - -def main(): - print("==========================================================") - print("=== Leann Document Search Demo (DiskANN + Recompute) ===") - print("==========================================================") - - INDEX_DIR = Path("./test_indices") - INDEX_PATH = str(INDEX_DIR / "documents.diskann") - BACKEND_TO_TEST = "diskann" - - if INDEX_DIR.exists(): - print(f"--- Cleaning up old index directory: {INDEX_DIR} ---") - shutil.rmtree(INDEX_DIR) - - # --- 1. Build index --- - print(f"\n[PHASE 1] Building index using '{BACKEND_TO_TEST}' backend...") - - builder = LeannBuilder(backend_name=BACKEND_TO_TEST, graph_degree=32, complexity=64) - - documents = load_sample_documents() - print(f"Loaded {len(documents)} sample documents.") - for doc in documents: - builder.add_text(doc["content"], metadata={"title": doc["title"]}) - - builder.build_index(INDEX_PATH) - print("\nIndex built!") - - # --- 2. Basic search demo --- - print(f"\n[PHASE 2] Basic search using '{BACKEND_TO_TEST}' backend...") - searcher = LeannSearcher(index_path=INDEX_PATH) - - query = "What is machine learning?" - print(f"\nQuery: '{query}'") - - print("\n--- Basic search mode (PQ computation) ---") - start_time = time.time() - results = searcher.search(query, top_k=2) - basic_time = time.time() - start_time - - print(f"โฑ๏ธ Basic search time: {basic_time:.3f} seconds") - print(">>> Basic search results <<<") - for i, res in enumerate(results, 1): - print( - f" {i}. ID: {res.id}, Score: {res.score:.4f}, Text: '{res.text}', Metadata: {res.metadata}" - ) - - # --- 3. Recompute search demo --- - print("\n[PHASE 3] Recompute search using embedding server...") - - print("\n--- Recompute search mode (get real embeddings via network) ---") - - # Configure recompute parameters - recompute_params = { - "recompute_beighbor_embeddings": True, # Enable network recomputation - "USE_DEFERRED_FETCH": False, # Don't use deferred fetch - "skip_search_reorder": True, # Skip search reordering - "dedup_node_dis": True, # Enable node distance deduplication - "prune_ratio": 0.1, # Pruning ratio 10% - "batch_recompute": False, # Don't use batch recomputation - "global_pruning": False, # Don't use global pruning - "zmq_port": 5555, # ZMQ port - "embedding_model": "sentence-transformers/all-mpnet-base-v2", - } - - print("Recompute parameter configuration:") - for key, value in recompute_params.items(): - print(f" {key}: {value}") - - print("\n๐Ÿ”„ Executing Recompute search...") - try: - start_time = time.time() - recompute_results = searcher.search(query, top_k=2, **recompute_params) - recompute_time = time.time() - start_time - - print(f"โฑ๏ธ Recompute search time: {recompute_time:.3f} seconds") - print(">>> Recompute search results <<<") - for i, res in enumerate(recompute_results, 1): - print( - f" {i}. ID: {res.id}, Score: {res.score:.4f}, Text: '{res.text}', Metadata: {res.metadata}" - ) - - # Compare results - print("\n--- Result comparison ---") - print(f"Basic search time: {basic_time:.3f} seconds") - print(f"Recompute time: {recompute_time:.3f} seconds") - - print("\nBasic search vs Recompute results:") - for i in range(min(len(results), len(recompute_results))): - basic_score = results[i].score - recompute_score = recompute_results[i].score - score_diff = abs(basic_score - recompute_score) - print( - f" Position {i + 1}: PQ={basic_score:.4f}, Recompute={recompute_score:.4f}, Difference={score_diff:.4f}" - ) - - if recompute_time > basic_time: - print("โœ… Recompute mode working correctly (more accurate but slower)") - else: - print("i๏ธ Recompute time is unusually fast, network recomputation may not be enabled") - - except Exception as e: - print(f"โŒ Recompute search failed: {e}") - print("This usually indicates an embedding server connection issue") - - # --- 4. Chat demo --- - print("\n[PHASE 4] Starting chat session...") - chat = LeannChat(index_path=INDEX_PATH) - chat_response = chat.ask(query) - print(f"You: {query}") - print(f"Leann: {chat_response}") - - print("\n==========================================================") - print("โœ… Demo finished successfully!") - print("==========================================================") - - -if __name__ == "__main__": - main() diff --git a/examples/google_history_reader_leann.py b/examples/google_history_reader_leann.py deleted file mode 100644 index 62781a4..0000000 --- a/examples/google_history_reader_leann.py +++ /dev/null @@ -1,362 +0,0 @@ -import argparse -import asyncio -import os - -try: - import dotenv - - dotenv.load_dotenv() -except ModuleNotFoundError: - # python-dotenv is not installed; skip loading environment variables - dotenv = None -from pathlib import Path - -from leann.api import LeannBuilder, LeannChat -from llama_index.core.node_parser import SentenceSplitter - -# dotenv.load_dotenv() # handled above if python-dotenv is available - -# Default Chrome profile path -DEFAULT_CHROME_PROFILE = os.path.expanduser("~/Library/Application Support/Google/Chrome/Default") - - -def create_leann_index_from_multiple_chrome_profiles( - profile_dirs: list[Path], - index_path: str = "chrome_history_index.leann", - max_count: int = -1, - embedding_model: str = "facebook/contriever", - embedding_mode: str = "sentence-transformers", -): - """ - Create LEANN index from multiple Chrome profile data sources. - - Args: - profile_dirs: List of Path objects pointing to Chrome profile directories - index_path: Path to save the LEANN index - max_count: Maximum number of history entries to process per profile - embedding_model: The embedding model to use - embedding_mode: The embedding backend mode - """ - print("Creating LEANN index from multiple Chrome profile data sources...") - - # Load documents using ChromeHistoryReader from history_data - from history_data.history import ChromeHistoryReader - - reader = ChromeHistoryReader() - - INDEX_DIR = Path(index_path).parent - - if not INDEX_DIR.exists(): - print("--- Index directory not found, building new index ---") - all_documents = [] - total_processed = 0 - - # Process each Chrome profile directory - for i, profile_dir in enumerate(profile_dirs): - print(f"\nProcessing Chrome profile {i + 1}/{len(profile_dirs)}: {profile_dir}") - - try: - documents = reader.load_data( - chrome_profile_path=str(profile_dir), max_count=max_count - ) - if documents: - print(f"Loaded {len(documents)} history documents from {profile_dir}") - all_documents.extend(documents) - total_processed += len(documents) - - # Check if we've reached the max count - if max_count > 0 and total_processed >= max_count: - print(f"Reached max count of {max_count} documents") - break - else: - print(f"No documents loaded from {profile_dir}") - except Exception as e: - print(f"Error processing {profile_dir}: {e}") - continue - - if not all_documents: - print("No documents loaded from any source. Exiting.") - # highlight info that you need to close all chrome browser before running this script and high light the instruction!! - print( - "\033[91mYou need to close or quit all chrome browser before running this script\033[0m" - ) - return None - - print( - f"\nTotal loaded {len(all_documents)} history documents from {len(profile_dirs)} profiles" - ) - - # Create text splitter with 256 chunk size - text_splitter = SentenceSplitter(chunk_size=256, chunk_overlap=128) - - # Convert Documents to text strings and chunk them - all_texts = [] - for doc in all_documents: - # Split the document into chunks - nodes = text_splitter.get_nodes_from_documents([doc]) - for node in nodes: - text = node.get_content() - # text = '[Title] ' + doc.metadata["title"] + '\n' + text - all_texts.append(text) - - print(f"Created {len(all_texts)} text chunks from {len(all_documents)} documents") - - # Create LEANN index directory - print("--- Index directory not found, building new index ---") - INDEX_DIR.mkdir(exist_ok=True) - - print("--- Building new LEANN index ---") - - print("\n[PHASE 1] Building Leann index...") - - # Use HNSW backend for better macOS compatibility - # LeannBuilder will automatically detect normalized embeddings and set appropriate distance metric - builder = LeannBuilder( - backend_name="hnsw", - embedding_model=embedding_model, - embedding_mode=embedding_mode, - graph_degree=32, - complexity=64, - is_compact=True, - is_recompute=True, - num_threads=1, # Force single-threaded mode - ) - - print(f"Adding {len(all_texts)} history chunks to index...") - for chunk_text in all_texts: - builder.add_text(chunk_text) - - builder.build_index(index_path) - print(f"\nLEANN index built at {index_path}!") - else: - print(f"--- Using existing index at {INDEX_DIR} ---") - - return index_path - - -def create_leann_index( - profile_path: str | None = None, - index_path: str = "chrome_history_index.leann", - max_count: int = 1000, - embedding_model: str = "facebook/contriever", - embedding_mode: str = "sentence-transformers", -): - """ - Create LEANN index from Chrome history data. - - Args: - profile_path: Path to the Chrome profile directory (optional, uses default if None) - index_path: Path to save the LEANN index - max_count: Maximum number of history entries to process - embedding_model: The embedding model to use - embedding_mode: The embedding backend mode - """ - print("Creating LEANN index from Chrome history data...") - INDEX_DIR = Path(index_path).parent - - if not INDEX_DIR.exists(): - print("--- Index directory not found, building new index ---") - INDEX_DIR.mkdir(exist_ok=True) - - print("--- Building new LEANN index ---") - - print("\n[PHASE 1] Building Leann index...") - - # Load documents using ChromeHistoryReader from history_data - from history_data.history import ChromeHistoryReader - - reader = ChromeHistoryReader() - - documents = reader.load_data(chrome_profile_path=profile_path, max_count=max_count) - - if not documents: - print("No documents loaded. Exiting.") - return None - - print(f"Loaded {len(documents)} history documents") - - # Create text splitter with 256 chunk size - text_splitter = SentenceSplitter(chunk_size=256, chunk_overlap=25) - - # Convert Documents to text strings and chunk them - all_texts = [] - for doc in documents: - # Split the document into chunks - nodes = text_splitter.get_nodes_from_documents([doc]) - for node in nodes: - all_texts.append(node.get_content()) - - print(f"Created {len(all_texts)} text chunks from {len(documents)} documents") - - # Create LEANN index directory - print("--- Index directory not found, building new index ---") - INDEX_DIR.mkdir(exist_ok=True) - - print("--- Building new LEANN index ---") - - print("\n[PHASE 1] Building Leann index...") - - # Use HNSW backend for better macOS compatibility - # LeannBuilder will automatically detect normalized embeddings and set appropriate distance metric - builder = LeannBuilder( - backend_name="hnsw", - embedding_model=embedding_model, - embedding_mode=embedding_mode, - graph_degree=32, - complexity=64, - is_compact=True, - is_recompute=True, - num_threads=1, # Force single-threaded mode - ) - - print(f"Adding {len(all_texts)} history chunks to index...") - for chunk_text in all_texts: - builder.add_text(chunk_text) - - builder.build_index(index_path) - print(f"\nLEANN index built at {index_path}!") - else: - print(f"--- Using existing index at {INDEX_DIR} ---") - - return index_path - - -async def query_leann_index(index_path: str, query: str): - """ - Query the LEANN index. - - Args: - index_path: Path to the LEANN index - query: The query string - """ - print("\n[PHASE 2] Starting Leann chat session...") - chat = LeannChat(index_path=index_path) - - print(f"You: {query}") - chat_response = chat.ask( - query, - top_k=10, - recompute_beighbor_embeddings=True, - complexity=32, - beam_width=1, - llm_config={ - "type": "openai", - "model": "gpt-4o", - "api_key": os.getenv("OPENAI_API_KEY"), - }, - llm_kwargs={"temperature": 0.0, "max_tokens": 1000}, - ) - - print(f"Leann chat response: \033[36m{chat_response}\033[0m") - - -async def main(): - # Parse command line arguments - parser = argparse.ArgumentParser( - description="LEANN Chrome History Reader - Create and query browser history index" - ) - parser.add_argument( - "--chrome-profile", - type=str, - default=DEFAULT_CHROME_PROFILE, - help=f"Path to Chrome profile directory (default: {DEFAULT_CHROME_PROFILE}), usually you dont need to change this", - ) - parser.add_argument( - "--index-dir", - type=str, - default="./google_history_index", - help="Directory to store the LEANN index (default: ./chrome_history_index_leann_test)", - ) - parser.add_argument( - "--max-entries", - type=int, - default=1000, - help="Maximum number of history entries to process (default: 1000)", - ) - parser.add_argument( - "--query", - type=str, - default=None, - help="Single query to run (default: runs example queries)", - ) - parser.add_argument( - "--auto-find-profiles", - action="store_true", - default=True, - help="Automatically find all Chrome profiles (default: True)", - ) - parser.add_argument( - "--embedding-model", - type=str, - default="facebook/contriever", - help="The embedding model to use (e.g., 'facebook/contriever', 'text-embedding-3-small')", - ) - parser.add_argument( - "--embedding-mode", - type=str, - default="sentence-transformers", - choices=["sentence-transformers", "openai", "mlx"], - help="The embedding backend mode", - ) - parser.add_argument( - "--use-existing-index", - action="store_true", - help="Use existing index without rebuilding", - ) - - args = parser.parse_args() - - INDEX_DIR = Path(args.index_dir) - INDEX_PATH = str(INDEX_DIR / "chrome_history.leann") - - print(f"Using Chrome profile: {args.chrome_profile}") - print(f"Index directory: {INDEX_DIR}") - print(f"Max entries: {args.max_entries}") - - if args.use_existing_index: - # Use existing index without rebuilding - if not Path(INDEX_PATH).exists(): - print(f"Error: Index file not found at {INDEX_PATH}") - return - print(f"Using existing index at {INDEX_PATH}") - index_path = INDEX_PATH - else: - # Find Chrome profile directories - from history_data.history import ChromeHistoryReader - - if args.auto_find_profiles: - profile_dirs = ChromeHistoryReader.find_chrome_profiles() - if not profile_dirs: - print("No Chrome profiles found automatically. Exiting.") - return - else: - # Use single specified profile - profile_path = Path(args.chrome_profile) - if not profile_path.exists(): - print(f"Chrome profile not found: {profile_path}") - return - profile_dirs = [profile_path] - - # Create or load the LEANN index from all sources - index_path = create_leann_index_from_multiple_chrome_profiles( - profile_dirs, INDEX_PATH, args.max_entries, args.embedding_model, args.embedding_mode - ) - - if index_path: - if args.query: - # Run single query - await query_leann_index(index_path, args.query) - else: - # Example queries - queries = [ - "What websites did I visit about machine learning?", - "Find my search history about programming", - ] - - for query in queries: - print("\n" + "=" * 60) - await query_leann_index(index_path, query) - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/examples/mail_reader_leann.py b/examples/mail_reader_leann.py deleted file mode 100644 index 6aa7536..0000000 --- a/examples/mail_reader_leann.py +++ /dev/null @@ -1,342 +0,0 @@ -import argparse -import asyncio -import os -import sys -from pathlib import Path - -import dotenv - -# Add the project root to Python path so we can import from examples -project_root = Path(__file__).parent.parent -sys.path.insert(0, str(project_root)) - -from leann.api import LeannBuilder, LeannChat -from llama_index.core.node_parser import SentenceSplitter - -dotenv.load_dotenv() - - -# Auto-detect user's mail path -def get_mail_path(): - """Get the mail path for the current user""" - home_dir = os.path.expanduser("~") - return os.path.join(home_dir, "Library", "Mail") - - -# Default mail path for macOS -DEFAULT_MAIL_PATH = "/Users/yichuan/Library/Mail/V10/0FCA0879-FD8C-4B7E-83BF-FDDA930791C5/[Gmail].mbox/All Mail.mbox/78BA5BE1-8819-4F9A-9613-EB63772F1DD0/Data" - - -def create_leann_index_from_multiple_sources( - messages_dirs: list[Path], - index_path: str = "mail_index.leann", - max_count: int = -1, - include_html: bool = False, - embedding_model: str = "facebook/contriever", -): - """ - Create LEANN index from multiple mail data sources. - - Args: - messages_dirs: List of Path objects pointing to Messages directories - index_path: Path to save the LEANN index - max_count: Maximum number of emails to process per directory - include_html: Whether to include HTML content in email processing - """ - print("Creating LEANN index from multiple mail data sources...") - - # Load documents using EmlxReader from LEANN_email_reader - from examples.email_data.LEANN_email_reader import EmlxReader - - reader = EmlxReader(include_html=include_html) - # from email_data.email import EmlxMboxReader - # from pathlib import Path - # reader = EmlxMboxReader() - INDEX_DIR = Path(index_path).parent - - if not INDEX_DIR.exists(): - print("--- Index directory not found, building new index ---") - all_documents = [] - total_processed = 0 - - # Process each Messages directory - for i, messages_dir in enumerate(messages_dirs): - print(f"\nProcessing Messages directory {i + 1}/{len(messages_dirs)}: {messages_dir}") - - try: - documents = reader.load_data(messages_dir) - if documents: - print(f"Loaded {len(documents)} email documents from {messages_dir}") - all_documents.extend(documents) - total_processed += len(documents) - - # Check if we've reached the max count - if max_count > 0 and total_processed >= max_count: - print(f"Reached max count of {max_count} documents") - break - else: - print(f"No documents loaded from {messages_dir}") - except Exception as e: - print(f"Error processing {messages_dir}: {e}") - continue - - if not all_documents: - print("No documents loaded from any source. Exiting.") - return None - - print( - f"\nTotal loaded {len(all_documents)} email documents from {len(messages_dirs)} directories and starting to split them into chunks" - ) - - # Create text splitter with 256 chunk size - text_splitter = SentenceSplitter(chunk_size=256, chunk_overlap=25) - - # Convert Documents to text strings and chunk them - all_texts = [] - for doc in all_documents: - # Split the document into chunks - nodes = text_splitter.get_nodes_from_documents([doc]) - for node in nodes: - text = node.get_content() - # text = '[subject] ' + doc.metadata["subject"] + '\n' + text - all_texts.append(text) - - print( - f"Finished splitting {len(all_documents)} documents into {len(all_texts)} text chunks" - ) - - # Create LEANN index directory - - print("--- Index directory not found, building new index ---") - INDEX_DIR.mkdir(exist_ok=True) - - print("--- Building new LEANN index ---") - - print("\n[PHASE 1] Building Leann index...") - - # Use HNSW backend for better macOS compatibility - builder = LeannBuilder( - backend_name="hnsw", - embedding_model=embedding_model, - graph_degree=32, - complexity=64, - is_compact=True, - is_recompute=True, - num_threads=1, # Force single-threaded mode - ) - - print(f"Adding {len(all_texts)} email chunks to index...") - for chunk_text in all_texts: - builder.add_text(chunk_text) - - builder.build_index(index_path) - print(f"\nLEANN index built at {index_path}!") - else: - print(f"--- Using existing index at {INDEX_DIR} ---") - - return index_path - - -def create_leann_index( - mail_path: str, - index_path: str = "mail_index.leann", - max_count: int = 1000, - include_html: bool = False, - embedding_model: str = "facebook/contriever", -): - """ - Create LEANN index from mail data. - - Args: - mail_path: Path to the mail directory - index_path: Path to save the LEANN index - max_count: Maximum number of emails to process - include_html: Whether to include HTML content in email processing - """ - print("Creating LEANN index from mail data...") - INDEX_DIR = Path(index_path).parent - - if not INDEX_DIR.exists(): - print("--- Index directory not found, building new index ---") - INDEX_DIR.mkdir(exist_ok=True) - - print("--- Building new LEANN index ---") - - print("\n[PHASE 1] Building Leann index...") - - # Load documents using EmlxReader from LEANN_email_reader - from examples.email_data.LEANN_email_reader import EmlxReader - - reader = EmlxReader(include_html=include_html) - # from email_data.email import EmlxMboxReader - # from pathlib import Path - # reader = EmlxMboxReader() - documents = reader.load_data(Path(mail_path)) - - if not documents: - print("No documents loaded. Exiting.") - return None - - print(f"Loaded {len(documents)} email documents") - - # Create text splitter with 256 chunk size - text_splitter = SentenceSplitter(chunk_size=256, chunk_overlap=128) - - # Convert Documents to text strings and chunk them - all_texts = [] - for doc in documents: - # Split the document into chunks - nodes = text_splitter.get_nodes_from_documents([doc]) - for node in nodes: - all_texts.append(node.get_content()) - - print(f"Created {len(all_texts)} text chunks from {len(documents)} documents") - - # Create LEANN index directory - - print("--- Index directory not found, building new index ---") - INDEX_DIR.mkdir(exist_ok=True) - - print("--- Building new LEANN index ---") - - print("\n[PHASE 1] Building Leann index...") - - # Use HNSW backend for better macOS compatibility - builder = LeannBuilder( - backend_name="hnsw", - embedding_model=embedding_model, - graph_degree=32, - complexity=64, - is_compact=True, - is_recompute=True, - num_threads=1, # Force single-threaded mode - ) - - print(f"Adding {len(all_texts)} email chunks to index...") - for chunk_text in all_texts: - builder.add_text(chunk_text) - - builder.build_index(index_path) - print(f"\nLEANN index built at {index_path}!") - else: - print(f"--- Using existing index at {INDEX_DIR} ---") - - return index_path - - -async def query_leann_index(index_path: str, query: str): - """ - Query the LEANN index. - - Args: - index_path: Path to the LEANN index - query: The query string - """ - print("\n[PHASE 2] Starting Leann chat session...") - chat = LeannChat(index_path=index_path, llm_config={"type": "openai", "model": "gpt-4o"}) - - print(f"You: {query}") - import time - - time.time() - chat_response = chat.ask( - query, - top_k=20, - recompute_beighbor_embeddings=True, - complexity=32, - beam_width=1, - ) - time.time() - # print(f"Time taken: {end_time - start_time} seconds") - # highlight the answer - print(f"Leann chat response: \033[36m{chat_response}\033[0m") - - -async def main(): - # Parse command line arguments - parser = argparse.ArgumentParser(description="LEANN Mail Reader - Create and query email index") - # Remove --mail-path argument and auto-detect all Messages directories - # Remove DEFAULT_MAIL_PATH - parser.add_argument( - "--index-dir", - type=str, - default="./mail_index", - help="Directory to store the LEANN index (default: ./mail_index_leann_raw_text_all_dicts)", - ) - parser.add_argument( - "--max-emails", - type=int, - default=1000, - help="Maximum number of emails to process (-1 means all)", - ) - parser.add_argument( - "--query", - type=str, - default="Give me some funny advertisement about apple or other companies", - help="Single query to run (default: runs example queries)", - ) - parser.add_argument( - "--include-html", - action="store_true", - default=False, - help="Include HTML content in email processing (default: False)", - ) - parser.add_argument( - "--embedding-model", - type=str, - default="facebook/contriever", - help="Embedding model to use (default: facebook/contriever)", - ) - - args = parser.parse_args() - - print(f"args: {args}") - - # Automatically find all Messages directories under the current user's Mail directory - from examples.email_data.LEANN_email_reader import find_all_messages_directories - - mail_path = get_mail_path() - print(f"Searching for email data in: {mail_path}") - messages_dirs = find_all_messages_directories(mail_path) - # messages_dirs = find_all_messages_directories(DEFAULT_MAIL_PATH) - # messages_dirs = [DEFAULT_MAIL_PATH] - # messages_dirs = messages_dirs[:1] - - print("len(messages_dirs): ", len(messages_dirs)) - - if not messages_dirs: - print("No Messages directories found. Exiting.") - return - - INDEX_DIR = Path(args.index_dir) - INDEX_PATH = str(INDEX_DIR / "mail_documents.leann") - print(f"Index directory: {INDEX_DIR}") - print(f"Found {len(messages_dirs)} Messages directories.") - - # Create or load the LEANN index from all sources - index_path = create_leann_index_from_multiple_sources( - messages_dirs, - INDEX_PATH, - args.max_emails, - args.include_html, - args.embedding_model, - ) - - if index_path: - if args.query: - # Run single query - await query_leann_index(index_path, args.query) - else: - # Example queries - queries = [ - "Hows Berkeley Graduate Student Instructor", - "how's the icloud related advertisement saying", - "Whats the number of class recommend to take per semester for incoming EECS students", - ] - for query in queries: - print("\n" + "=" * 60) - await query_leann_index(index_path, query) - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/examples/mail_reader_llamaindex.py b/examples/mail_reader_llamaindex.py deleted file mode 100644 index cfb6b82..0000000 --- a/examples/mail_reader_llamaindex.py +++ /dev/null @@ -1,135 +0,0 @@ -import argparse -import os -import sys -from pathlib import Path - -# Add the project root to Python path so we can import from examples -project_root = Path(__file__).parent.parent -sys.path.insert(0, str(project_root)) - -import torch -from llama_index.core import StorageContext, VectorStoreIndex -from llama_index.core.node_parser import SentenceSplitter - -# --- EMBEDDING MODEL --- -from llama_index.embeddings.huggingface import HuggingFaceEmbedding - -# --- END EMBEDDING MODEL --- -# Import EmlxReader from the new module -from examples.email_data.LEANN_email_reader import EmlxReader - - -def create_and_save_index( - mail_path: str, - save_dir: str = "mail_index_embedded", - max_count: int = 1000, - include_html: bool = False, -): - print("Creating index from mail data with embedded metadata...") - documents = EmlxReader(include_html=include_html).load_data(mail_path, max_count=max_count) - if not documents: - print("No documents loaded. Exiting.") - return None - text_splitter = SentenceSplitter(chunk_size=256, chunk_overlap=25) - # Use facebook/contriever as the embedder - embed_model = HuggingFaceEmbedding(model_name="facebook/contriever") - # set on device - - if torch.cuda.is_available(): - embed_model._model.to("cuda") - # set mps - elif torch.backends.mps.is_available(): - embed_model._model.to("mps") - else: - embed_model._model.to("cpu") - index = VectorStoreIndex.from_documents( - documents, transformations=[text_splitter], embed_model=embed_model - ) - os.makedirs(save_dir, exist_ok=True) - index.storage_context.persist(persist_dir=save_dir) - print(f"Index saved to {save_dir}") - return index - - -def load_index(save_dir: str = "mail_index_embedded"): - try: - storage_context = StorageContext.from_defaults(persist_dir=save_dir) - index = VectorStoreIndex.from_vector_store( - storage_context.vector_store, storage_context=storage_context - ) - print(f"Index loaded from {save_dir}") - return index - except Exception as e: - print(f"Error loading index: {e}") - return None - - -def query_index(index, query: str): - if index is None: - print("No index available for querying.") - return - query_engine = index.as_query_engine() - response = query_engine.query(query) - print(f"Query: {query}") - print(f"Response: {response}") - - -def main(): - # Parse command line arguments - parser = argparse.ArgumentParser( - description="LlamaIndex Mail Reader - Create and query email index" - ) - parser.add_argument( - "--mail-path", - type=str, - default="/Users/yichuan/Library/Mail/V10/0FCA0879-FD8C-4B7E-83BF-FDDA930791C5/[Gmail].mbox/All Mail.mbox/78BA5BE1-8819-4F9A-9613-EB63772F1DD0/Data/9/Messages", - help="Path to mail data directory", - ) - parser.add_argument( - "--save-dir", - type=str, - default="mail_index_embedded", - help="Directory to store the index (default: mail_index_embedded)", - ) - parser.add_argument( - "--max-emails", - type=int, - default=10000, - help="Maximum number of emails to process", - ) - parser.add_argument( - "--include-html", - action="store_true", - default=False, - help="Include HTML content in email processing (default: False)", - ) - - args = parser.parse_args() - - mail_path = args.mail_path - save_dir = args.save_dir - - if os.path.exists(save_dir) and os.path.exists(os.path.join(save_dir, "vector_store.json")): - print("Loading existing index...") - index = load_index(save_dir) - else: - print("Creating new index...") - index = create_and_save_index( - mail_path, - save_dir, - max_count=args.max_emails, - include_html=args.include_html, - ) - if index: - queries = [ - "Hows Berkeley Graduate Student Instructor", - "how's the icloud related advertisement saying", - "Whats the number of class recommend to take per semester for incoming EECS students", - ] - for query in queries: - print("\n" + "=" * 50) - query_index(index, query) - - -if __name__ == "__main__": - main() diff --git a/examples/main_cli_example.py b/examples/main_cli_example.py deleted file mode 100644 index 9b65725..0000000 --- a/examples/main_cli_example.py +++ /dev/null @@ -1,146 +0,0 @@ -import argparse -import asyncio -from pathlib import Path - -import dotenv -from leann.api import LeannBuilder, LeannChat -from llama_index.core import SimpleDirectoryReader -from llama_index.core.node_parser import SentenceSplitter - -dotenv.load_dotenv() - - -async def main(args): - INDEX_DIR = Path(args.index_dir) - INDEX_PATH = str(INDEX_DIR / "pdf_documents.leann") - - if not INDEX_DIR.exists(): - node_parser = SentenceSplitter( - chunk_size=256, chunk_overlap=128, separator=" ", paragraph_separator="\n\n" - ) - - print("Loading documents...") - documents = SimpleDirectoryReader( - args.data_dir, - recursive=True, - encoding="utf-8", - required_exts=[".pdf", ".txt", ".md"], - ).load_data(show_progress=True) - print("Documents loaded.") - all_texts = [] - for doc in documents: - nodes = node_parser.get_nodes_from_documents([doc]) - if nodes: - all_texts.extend(node.get_content() for node in nodes) - - print("--- Index directory not found, building new index ---") - - print("\n[PHASE 1] Building Leann index...") - - # LeannBuilder now automatically detects normalized embeddings and sets appropriate distance metric - print(f"Using {args.embedding_model} with {args.embedding_mode} mode") - - # Use HNSW backend for better macOS compatibility - builder = LeannBuilder( - backend_name="hnsw", - embedding_model=args.embedding_model, - embedding_mode=args.embedding_mode, - # distance_metric is automatically set based on embedding model - graph_degree=32, - complexity=64, - is_compact=True, - is_recompute=True, - num_threads=1, # Force single-threaded mode - ) - - print(f"Loaded {len(all_texts)} text chunks from documents.") - for chunk_text in all_texts: - builder.add_text(chunk_text) - - builder.build_index(INDEX_PATH) - print(f"\nLeann index built at {INDEX_PATH}!") - else: - print(f"--- Using existing index at {INDEX_DIR} ---") - - print("\n[PHASE 2] Starting Leann chat session...") - - # Build llm_config based on command line arguments - if args.llm == "simulated": - llm_config = {"type": "simulated"} - elif args.llm == "ollama": - llm_config = {"type": "ollama", "model": args.model, "host": args.host} - elif args.llm == "hf": - llm_config = {"type": "hf", "model": args.model} - elif args.llm == "openai": - llm_config = {"type": "openai", "model": args.model} - else: - raise ValueError(f"Unknown LLM type: {args.llm}") - - print(f"Using LLM: {args.llm} with model: {args.model if args.llm != 'simulated' else 'N/A'}") - - chat = LeannChat(index_path=INDEX_PATH, llm_config=llm_config) - # query = ( - # "ไป€ไนˆๆ˜ฏ็›˜ๅคๅคงๆจกๅž‹ไปฅๅŠ็›˜ๅคๅผ€ๅ‘่ฟ‡็จ‹ไธญ้‡ๅˆฐไบ†ไป€ไนˆ้˜ดๆš—้ข,ไปปๅŠกไปคไธ€่ˆฌๅœจไป€ไนˆๅŸŽๅธ‚้ขๅ‘" - # ) - query = args.query - - print(f"You: {query}") - chat_response = chat.ask(query, top_k=20, recompute_embeddings=True, complexity=32) - print(f"Leann chat response: \033[36m{chat_response}\033[0m") - - -if __name__ == "__main__": - parser = argparse.ArgumentParser(description="Run Leann Chat with various LLM backends.") - parser.add_argument( - "--llm", - type=str, - default="openai", - choices=["simulated", "ollama", "hf", "openai"], - help="The LLM backend to use.", - ) - parser.add_argument( - "--model", - type=str, - default="gpt-4o", - help="The model name to use (e.g., 'llama3:8b' for ollama, 'deepseek-ai/deepseek-llm-7b-chat' for hf, 'gpt-4o' for openai).", - ) - parser.add_argument( - "--embedding-model", - type=str, - default="facebook/contriever", - help="The embedding model to use (e.g., 'facebook/contriever', 'text-embedding-3-small').", - ) - parser.add_argument( - "--embedding-mode", - type=str, - default="sentence-transformers", - choices=["sentence-transformers", "openai", "mlx"], - help="The embedding backend mode.", - ) - parser.add_argument( - "--host", - type=str, - default="http://localhost:11434", - help="The host for the Ollama API.", - ) - parser.add_argument( - "--index-dir", - type=str, - default="./test_doc_files", - help="Directory where the Leann index will be stored.", - ) - parser.add_argument( - "--data-dir", - type=str, - default="examples/data", - help="Directory containing documents to index (PDF, TXT, MD files).", - ) - parser.add_argument( - "--query", - type=str, - default="Based on the paper, what are the main techniques LEANN explores to reduce the storage overhead and DLPM explore to achieve Fairness and Efiiciency trade-off?", - help="The query to ask the Leann chat system.", - ) - args = parser.parse_args() - - asyncio.run(main(args)) diff --git a/test/build_mlx_index.py b/examples/mlx_demo.py similarity index 100% rename from test/build_mlx_index.py rename to examples/mlx_demo.py diff --git a/examples/multi_vector_aggregator.py b/examples/multi_vector_aggregator.py deleted file mode 100644 index 3ec376a..0000000 --- a/examples/multi_vector_aggregator.py +++ /dev/null @@ -1,360 +0,0 @@ -#!/usr/bin/env python3 -""" -Multi-Vector Aggregator for Fat Embeddings -========================================== - -This module implements aggregation strategies for multi-vector embeddings, -similar to ColPali's approach where multiple patch vectors represent a single document. - -Key features: -- MaxSim aggregation (take maximum similarity across patches) -- Voting-based aggregation (count patch matches) -- Weighted aggregation (attention-score weighted) -- Spatial clustering of matching patches -- Document-level result consolidation -""" - -from collections import defaultdict -from dataclasses import dataclass -from typing import Any - -import numpy as np - - -@dataclass -class PatchResult: - """Represents a single patch search result.""" - - patch_id: int - image_name: str - image_path: str - coordinates: tuple[int, int, int, int] # (x1, y1, x2, y2) - score: float - attention_score: float - scale: float - metadata: dict[str, Any] - - -@dataclass -class AggregatedResult: - """Represents an aggregated document-level result.""" - - image_name: str - image_path: str - doc_score: float - patch_count: int - best_patch: PatchResult - all_patches: list[PatchResult] - aggregation_method: str - spatial_clusters: list[list[PatchResult]] | None = None - - -class MultiVectorAggregator: - """ - Aggregates multiple patch-level results into document-level results. - """ - - def __init__( - self, - aggregation_method: str = "maxsim", - spatial_clustering: bool = True, - cluster_distance_threshold: float = 100.0, - ): - """ - Initialize the aggregator. - - Args: - aggregation_method: "maxsim", "voting", "weighted", or "mean" - spatial_clustering: Whether to cluster spatially close patches - cluster_distance_threshold: Distance threshold for spatial clustering - """ - self.aggregation_method = aggregation_method - self.spatial_clustering = spatial_clustering - self.cluster_distance_threshold = cluster_distance_threshold - - def aggregate_results( - self, search_results: list[dict[str, Any]], top_k: int = 10 - ) -> list[AggregatedResult]: - """ - Aggregate patch-level search results into document-level results. - - Args: - search_results: List of search results from LeannSearcher - top_k: Number of top documents to return - - Returns: - List of aggregated document results - """ - # Group results by image - image_groups = defaultdict(list) - - for result in search_results: - metadata = result.metadata - if "image_name" in metadata and "patch_id" in metadata: - patch_result = PatchResult( - patch_id=metadata["patch_id"], - image_name=metadata["image_name"], - image_path=metadata["image_path"], - coordinates=tuple(metadata["coordinates"]), - score=result.score, - attention_score=metadata.get("attention_score", 0.0), - scale=metadata.get("scale", 1.0), - metadata=metadata, - ) - image_groups[metadata["image_name"]].append(patch_result) - - # Aggregate each image group - aggregated_results = [] - for image_name, patches in image_groups.items(): - if len(patches) == 0: - continue - - agg_result = self._aggregate_image_patches(image_name, patches) - aggregated_results.append(agg_result) - - # Sort by aggregated score and return top-k - aggregated_results.sort(key=lambda x: x.doc_score, reverse=True) - return aggregated_results[:top_k] - - def _aggregate_image_patches( - self, image_name: str, patches: list[PatchResult] - ) -> AggregatedResult: - """Aggregate patches for a single image.""" - - if self.aggregation_method == "maxsim": - doc_score = max(patch.score for patch in patches) - best_patch = max(patches, key=lambda p: p.score) - - elif self.aggregation_method == "voting": - # Count patches above threshold - threshold = np.percentile([p.score for p in patches], 75) - doc_score = sum(1 for patch in patches if patch.score >= threshold) - best_patch = max(patches, key=lambda p: p.score) - - elif self.aggregation_method == "weighted": - # Weight by attention scores - total_weighted_score = sum(p.score * p.attention_score for p in patches) - total_weights = sum(p.attention_score for p in patches) - doc_score = total_weighted_score / max(total_weights, 1e-8) - best_patch = max(patches, key=lambda p: p.score * p.attention_score) - - elif self.aggregation_method == "mean": - doc_score = np.mean([patch.score for patch in patches]) - best_patch = max(patches, key=lambda p: p.score) - - else: - raise ValueError(f"Unknown aggregation method: {self.aggregation_method}") - - # Spatial clustering if enabled - spatial_clusters = None - if self.spatial_clustering: - spatial_clusters = self._cluster_patches_spatially(patches) - - return AggregatedResult( - image_name=image_name, - image_path=patches[0].image_path, - doc_score=float(doc_score), - patch_count=len(patches), - best_patch=best_patch, - all_patches=sorted(patches, key=lambda p: p.score, reverse=True), - aggregation_method=self.aggregation_method, - spatial_clusters=spatial_clusters, - ) - - def _cluster_patches_spatially(self, patches: list[PatchResult]) -> list[list[PatchResult]]: - """Cluster patches that are spatially close to each other.""" - if len(patches) <= 1: - return [patches] - - clusters = [] - remaining_patches = patches.copy() - - while remaining_patches: - # Start new cluster with highest scoring remaining patch - seed_patch = max(remaining_patches, key=lambda p: p.score) - current_cluster = [seed_patch] - remaining_patches.remove(seed_patch) - - # Add nearby patches to cluster - added_to_cluster = True - while added_to_cluster: - added_to_cluster = False - for patch in remaining_patches.copy(): - if self._is_patch_nearby(patch, current_cluster): - current_cluster.append(patch) - remaining_patches.remove(patch) - added_to_cluster = True - - clusters.append(current_cluster) - - return sorted(clusters, key=lambda cluster: max(p.score for p in cluster), reverse=True) - - def _is_patch_nearby(self, patch: PatchResult, cluster: list[PatchResult]) -> bool: - """Check if a patch is spatially close to any patch in the cluster.""" - patch_center = self._get_patch_center(patch.coordinates) - - for cluster_patch in cluster: - cluster_center = self._get_patch_center(cluster_patch.coordinates) - distance = np.sqrt( - (patch_center[0] - cluster_center[0]) ** 2 - + (patch_center[1] - cluster_center[1]) ** 2 - ) - - if distance <= self.cluster_distance_threshold: - return True - - return False - - def _get_patch_center(self, coordinates: tuple[int, int, int, int]) -> tuple[float, float]: - """Get center point of a patch.""" - x1, y1, x2, y2 = coordinates - return ((x1 + x2) / 2, (y1 + y2) / 2) - - def print_aggregated_results( - self, results: list[AggregatedResult], max_patches_per_doc: int = 3 - ): - """Pretty print aggregated results.""" - print(f"\n๐Ÿ” Aggregated Results (method: {self.aggregation_method})") - print("=" * 80) - - for i, result in enumerate(results): - print(f"\n{i + 1}. {result.image_name}") - print(f" Doc Score: {result.doc_score:.4f} | Patches: {result.patch_count}") - print(f" Path: {result.image_path}") - - # Show best patch - best = result.best_patch - print( - f" ๐ŸŒŸ Best Patch: #{best.patch_id} at {best.coordinates} (score: {best.score:.4f})" - ) - - # Show top patches - print(" ๐Ÿ“ Top Patches:") - for j, patch in enumerate(result.all_patches[:max_patches_per_doc]): - print( - f" {j + 1}. Patch #{patch.patch_id}: {patch.score:.4f} at {patch.coordinates}" - ) - - # Show spatial clusters if available - if result.spatial_clusters and len(result.spatial_clusters) > 1: - print(f" ๐Ÿ—‚๏ธ Spatial Clusters: {len(result.spatial_clusters)}") - for j, cluster in enumerate(result.spatial_clusters[:2]): # Show top 2 clusters - cluster_score = max(p.score for p in cluster) - print( - f" Cluster {j + 1}: {len(cluster)} patches (best: {cluster_score:.4f})" - ) - - -def demo_aggregation(): - """Demonstrate the multi-vector aggregation functionality.""" - print("=== Multi-Vector Aggregation Demo ===") - - # Simulate some patch-level search results - # In real usage, these would come from LeannSearcher.search() - - class MockResult: - def __init__(self, score, metadata): - self.score = score - self.metadata = metadata - - # Simulate results for 2 images with multiple patches each - mock_results = [ - # Image 1: cats_and_kitchen.jpg - 4 patches - MockResult( - 0.85, - { - "image_name": "cats_and_kitchen.jpg", - "image_path": "/path/to/cats_and_kitchen.jpg", - "patch_id": 3, - "coordinates": [100, 50, 224, 174], # Kitchen area - "attention_score": 0.92, - "scale": 1.0, - }, - ), - MockResult( - 0.78, - { - "image_name": "cats_and_kitchen.jpg", - "image_path": "/path/to/cats_and_kitchen.jpg", - "patch_id": 7, - "coordinates": [200, 300, 324, 424], # Cat area - "attention_score": 0.88, - "scale": 1.0, - }, - ), - MockResult( - 0.72, - { - "image_name": "cats_and_kitchen.jpg", - "image_path": "/path/to/cats_and_kitchen.jpg", - "patch_id": 12, - "coordinates": [150, 100, 274, 224], # Appliances - "attention_score": 0.75, - "scale": 1.0, - }, - ), - MockResult( - 0.65, - { - "image_name": "cats_and_kitchen.jpg", - "image_path": "/path/to/cats_and_kitchen.jpg", - "patch_id": 15, - "coordinates": [50, 250, 174, 374], # Furniture - "attention_score": 0.70, - "scale": 1.0, - }, - ), - # Image 2: city_street.jpg - 3 patches - MockResult( - 0.68, - { - "image_name": "city_street.jpg", - "image_path": "/path/to/city_street.jpg", - "patch_id": 2, - "coordinates": [300, 100, 424, 224], # Buildings - "attention_score": 0.80, - "scale": 1.0, - }, - ), - MockResult( - 0.62, - { - "image_name": "city_street.jpg", - "image_path": "/path/to/city_street.jpg", - "patch_id": 8, - "coordinates": [100, 350, 224, 474], # Street level - "attention_score": 0.75, - "scale": 1.0, - }, - ), - MockResult( - 0.55, - { - "image_name": "city_street.jpg", - "image_path": "/path/to/city_street.jpg", - "patch_id": 11, - "coordinates": [400, 200, 524, 324], # Sky area - "attention_score": 0.60, - "scale": 1.0, - }, - ), - ] - - # Test different aggregation methods - methods = ["maxsim", "voting", "weighted", "mean"] - - for method in methods: - print(f"\n{'=' * 20} {method.upper()} AGGREGATION {'=' * 20}") - - aggregator = MultiVectorAggregator( - aggregation_method=method, - spatial_clustering=True, - cluster_distance_threshold=100.0, - ) - - aggregated = aggregator.aggregate_results(mock_results, top_k=5) - aggregator.print_aggregated_results(aggregated) - - -if __name__ == "__main__": - demo_aggregation() diff --git a/examples/openai_hnsw_example.py b/examples/openai_hnsw_example.py deleted file mode 100644 index 9dcbbf8..0000000 --- a/examples/openai_hnsw_example.py +++ /dev/null @@ -1,113 +0,0 @@ -#!/usr/bin/env python3 -""" -OpenAI Embedding Example - -Complete example showing how to build and search with OpenAI embeddings using HNSW backend. -""" - -import os -from pathlib import Path - -import dotenv -from leann.api import LeannBuilder, LeannSearcher - -# Load environment variables -dotenv.load_dotenv() - - -def main(): - # Check if OpenAI API key is available - api_key = os.getenv("OPENAI_API_KEY") - if not api_key: - print("ERROR: OPENAI_API_KEY environment variable not set") - return False - - print(f"โœ… OpenAI API key found: {api_key[:10]}...") - - # Sample texts - sample_texts = [ - "Machine learning is a powerful technology that enables computers to learn from data.", - "Natural language processing helps computers understand and generate human language.", - "Deep learning uses neural networks with multiple layers to solve complex problems.", - "Computer vision allows machines to interpret and understand visual information.", - "Reinforcement learning trains agents to make decisions through trial and error.", - "Data science combines statistics, math, and programming to extract insights from data.", - "Artificial intelligence aims to create machines that can perform human-like tasks.", - "Python is a popular programming language used extensively in data science and AI.", - "Neural networks are inspired by the structure and function of the human brain.", - "Big data refers to extremely large datasets that require special tools to process.", - ] - - INDEX_DIR = Path("./simple_openai_test_index") - INDEX_PATH = str(INDEX_DIR / "simple_test.leann") - - print("\n=== Building Index with OpenAI Embeddings ===") - print(f"Index path: {INDEX_PATH}") - - try: - # Use proper configuration for OpenAI embeddings - builder = LeannBuilder( - backend_name="hnsw", - embedding_model="text-embedding-3-small", - embedding_mode="openai", - # HNSW settings for OpenAI embeddings - M=16, # Smaller graph degree - efConstruction=64, # Smaller construction complexity - is_compact=True, # Enable compact storage for recompute - is_recompute=True, # MUST enable for OpenAI embeddings - num_threads=1, - ) - - print(f"Adding {len(sample_texts)} texts to the index...") - for i, text in enumerate(sample_texts): - metadata = {"id": f"doc_{i}", "topic": "AI"} - builder.add_text(text, metadata) - - print("Building index...") - builder.build_index(INDEX_PATH) - print("โœ… Index built successfully!") - - except Exception as e: - print(f"โŒ Error building index: {e}") - import traceback - - traceback.print_exc() - return False - - print("\n=== Testing Search ===") - - try: - searcher = LeannSearcher(INDEX_PATH) - - test_queries = [ - "What is machine learning?", - "How do neural networks work?", - "Programming languages for data science", - ] - - for query in test_queries: - print(f"\n๐Ÿ” Query: '{query}'") - results = searcher.search(query, top_k=3) - - print(f" Found {len(results)} results:") - for i, result in enumerate(results): - print(f" {i + 1}. Score: {result.score:.4f}") - print(f" Text: {result.text[:80]}...") - - print("\nโœ… Search test completed successfully!") - return True - - except Exception as e: - print(f"โŒ Error during search: {e}") - import traceback - - traceback.print_exc() - return False - - -if __name__ == "__main__": - success = main() - if success: - print("\n๐ŸŽ‰ Simple OpenAI index test completed successfully!") - else: - print("\n๐Ÿ’ฅ Simple OpenAI index test failed!") diff --git a/examples/resue_index.py b/examples/resue_index.py deleted file mode 100644 index bec55ab..0000000 --- a/examples/resue_index.py +++ /dev/null @@ -1,23 +0,0 @@ -import asyncio -from pathlib import Path - -from leann.api import LeannChat - -INDEX_DIR = Path("./test_pdf_index_huawei") -INDEX_PATH = str(INDEX_DIR / "pdf_documents.leann") - - -async def main(): - print("\n[PHASE 2] Starting Leann chat session...") - chat = LeannChat(index_path=INDEX_PATH) - query = "What is the main idea of RL and give me 5 exapmle of classic RL algorithms?" - query = "Based on the paper, what are the main techniques LEANN explores to reduce the storage overhead and DLPM explore to achieve Fairness and Efiiciency trade-off?" - # query = "ไป€ไนˆๆ˜ฏ็›˜ๅคๅคงๆจกๅž‹ไปฅๅŠ็›˜ๅคๅผ€ๅ‘่ฟ‡็จ‹ไธญ้‡ๅˆฐไบ†ไป€ไนˆ้˜ดๆš—้ข,ไปปๅŠกไปคไธ€่ˆฌๅœจไป€ไนˆๅŸŽๅธ‚้ขๅ‘" - response = chat.ask( - query, top_k=20, recompute_beighbor_embeddings=True, complexity=32, beam_width=1 - ) - print(f"\n[PHASE 2] Response: {response}") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/examples/wechat_history_reader_leann.py b/examples/wechat_history_reader_leann.py deleted file mode 100644 index ef68c6b..0000000 --- a/examples/wechat_history_reader_leann.py +++ /dev/null @@ -1,320 +0,0 @@ -import argparse -import asyncio -import os -from pathlib import Path - -import dotenv -from leann.api import LeannBuilder, LeannChat -from llama_index.core.node_parser import SentenceSplitter - -dotenv.load_dotenv() - -# Default WeChat export directory -DEFAULT_WECHAT_EXPORT_DIR = "./wechat_export_direct" - - -def create_leann_index_from_multiple_wechat_exports( - export_dirs: list[Path], - index_path: str = "wechat_history_index.leann", - max_count: int = -1, -): - """ - Create LEANN index from multiple WeChat export data sources. - - Args: - export_dirs: List of Path objects pointing to WeChat export directories - index_path: Path to save the LEANN index - max_count: Maximum number of chat entries to process per export - """ - print("Creating LEANN index from multiple WeChat export data sources...") - - # Load documents using WeChatHistoryReader from history_data - from history_data.wechat_history import WeChatHistoryReader - - reader = WeChatHistoryReader() - - INDEX_DIR = Path(index_path).parent - - if not INDEX_DIR.exists(): - print("--- Index directory not found, building new index ---") - all_documents = [] - total_processed = 0 - - # Process each WeChat export directory - for i, export_dir in enumerate(export_dirs): - print(f"\nProcessing WeChat export {i + 1}/{len(export_dirs)}: {export_dir}") - - try: - documents = reader.load_data( - wechat_export_dir=str(export_dir), - max_count=max_count, - concatenate_messages=True, # Disable concatenation - one message per document - ) - if documents: - print(f"Loaded {len(documents)} chat documents from {export_dir}") - all_documents.extend(documents) - total_processed += len(documents) - - # Check if we've reached the max count - if max_count > 0 and total_processed >= max_count: - print(f"Reached max count of {max_count} documents") - break - else: - print(f"No documents loaded from {export_dir}") - except Exception as e: - print(f"Error processing {export_dir}: {e}") - continue - - if not all_documents: - print("No documents loaded from any source. Exiting.") - return None - - print( - f"\nTotal loaded {len(all_documents)} chat documents from {len(export_dirs)} exports and starting to split them into chunks" - ) - - # Create text splitter with 256 chunk size - text_splitter = SentenceSplitter(chunk_size=192, chunk_overlap=64) - - # Convert Documents to text strings and chunk them - all_texts = [] - for doc in all_documents: - # Split the document into chunks - nodes = text_splitter.get_nodes_from_documents([doc]) - for node in nodes: - text = ( - "[Contact] means the message is from: " - + doc.metadata["contact_name"] - + "\n" - + node.get_content() - ) - all_texts.append(text) - - print( - f"Finished splitting {len(all_documents)} documents into {len(all_texts)} text chunks" - ) - - # Create LEANN index directory - print("--- Index directory not found, building new index ---") - INDEX_DIR.mkdir(exist_ok=True) - - print("--- Building new LEANN index ---") - - print("\n[PHASE 1] Building Leann index...") - - # Use HNSW backend for better macOS compatibility - builder = LeannBuilder( - backend_name="hnsw", - embedding_model="Qwen/Qwen3-Embedding-0.6B", - graph_degree=32, - complexity=64, - is_compact=True, - is_recompute=True, - num_threads=1, # Force single-threaded mode - ) - - print(f"Adding {len(all_texts)} chat chunks to index...") - for chunk_text in all_texts: - builder.add_text(chunk_text) - - builder.build_index(index_path) - print(f"\nLEANN index built at {index_path}!") - else: - print(f"--- Using existing index at {INDEX_DIR} ---") - - return index_path - - -def create_leann_index( - export_dir: str | None = None, - index_path: str = "wechat_history_index.leann", - max_count: int = 1000, -): - """ - Create LEANN index from WeChat chat history data. - - Args: - export_dir: Path to the WeChat export directory (optional, uses default if None) - index_path: Path to save the LEANN index - max_count: Maximum number of chat entries to process - """ - print("Creating LEANN index from WeChat chat history data...") - INDEX_DIR = Path(index_path).parent - - if not INDEX_DIR.exists(): - print("--- Index directory not found, building new index ---") - INDEX_DIR.mkdir(exist_ok=True) - - print("--- Building new LEANN index ---") - - print("\n[PHASE 1] Building Leann index...") - - # Load documents using WeChatHistoryReader from history_data - from history_data.wechat_history import WeChatHistoryReader - - reader = WeChatHistoryReader() - - documents = reader.load_data( - wechat_export_dir=export_dir, - max_count=max_count, - concatenate_messages=False, # Disable concatenation - one message per document - ) - - if not documents: - print("No documents loaded. Exiting.") - return None - - print(f"Loaded {len(documents)} chat documents") - - # Create text splitter with 256 chunk size - text_splitter = SentenceSplitter(chunk_size=256, chunk_overlap=25) - - # Convert Documents to text strings and chunk them - all_texts = [] - for doc in documents: - # Split the document into chunks - nodes = text_splitter.get_nodes_from_documents([doc]) - for node in nodes: - all_texts.append(node.get_content()) - - print(f"Created {len(all_texts)} text chunks from {len(documents)} documents") - - # Create LEANN index directory - print("--- Index directory not found, building new index ---") - INDEX_DIR.mkdir(exist_ok=True) - - print("--- Building new LEANN index ---") - - print("\n[PHASE 1] Building Leann index...") - - # Use HNSW backend for better macOS compatibility - builder = LeannBuilder( - backend_name="hnsw", - embedding_model="mlx-community/Qwen3-Embedding-0.6B-4bit-DWQ", # MLX-optimized model - graph_degree=32, - complexity=64, - is_compact=True, - is_recompute=True, - num_threads=1, # Force single-threaded mode - ) - - print(f"Adding {len(all_texts)} chat chunks to index...") - for chunk_text in all_texts: - builder.add_text(chunk_text) - - builder.build_index(index_path) - print(f"\nLEANN index built at {index_path}!") - else: - print(f"--- Using existing index at {INDEX_DIR} ---") - - return index_path - - -async def query_leann_index(index_path: str, query: str): - """ - Query the LEANN index. - - Args: - index_path: Path to the LEANN index - query: The query string - """ - print("\n[PHASE 2] Starting Leann chat session...") - chat = LeannChat(index_path=index_path) - - print(f"You: {query}") - chat_response = chat.ask( - query, - top_k=20, - recompute_beighbor_embeddings=True, - complexity=16, - beam_width=1, - llm_config={ - "type": "openai", - "model": "gpt-4o", - "api_key": os.getenv("OPENAI_API_KEY"), - }, - llm_kwargs={"temperature": 0.0, "max_tokens": 1000}, - ) - print(f"Leann chat response: \033[36m{chat_response}\033[0m") - - -async def main(): - """Main function with integrated WeChat export functionality.""" - - # Parse command line arguments - parser = argparse.ArgumentParser( - description="LEANN WeChat History Reader - Create and query WeChat chat history index" - ) - parser.add_argument( - "--export-dir", - type=str, - default=DEFAULT_WECHAT_EXPORT_DIR, - help=f"Directory to store WeChat exports (default: {DEFAULT_WECHAT_EXPORT_DIR})", - ) - parser.add_argument( - "--index-dir", - type=str, - default="./wechat_history_magic_test_11Debug_new", - help="Directory to store the LEANN index (default: ./wechat_history_index_leann_test)", - ) - parser.add_argument( - "--max-entries", - type=int, - default=50, - help="Maximum number of chat entries to process (default: 5000)", - ) - parser.add_argument( - "--query", - type=str, - default=None, - help="Single query to run (default: runs example queries)", - ) - parser.add_argument( - "--force-export", - action="store_true", - default=False, - help="Force re-export of WeChat data even if exports exist", - ) - - args = parser.parse_args() - - INDEX_DIR = Path(args.index_dir) - INDEX_PATH = str(INDEX_DIR / "wechat_history.leann") - - print(f"Using WeChat export directory: {args.export_dir}") - print(f"Index directory: {INDEX_DIR}") - print(f"Max entries: {args.max_entries}") - - # Initialize WeChat reader with export capabilities - from history_data.wechat_history import WeChatHistoryReader - - reader = WeChatHistoryReader() - - # Find existing exports or create new ones using the centralized method - export_dirs = reader.find_or_export_wechat_data(args.export_dir) - if not export_dirs: - print("Failed to find or export WeChat data. Exiting.") - return - - # Create or load the LEANN index from all sources - index_path = create_leann_index_from_multiple_wechat_exports( - export_dirs, INDEX_PATH, max_count=args.max_entries - ) - - if index_path: - if args.query: - # Run single query - await query_leann_index(index_path, args.query) - else: - # Example queries - queries = [ - "ๆˆ‘ๆƒณไนฐ้ญ”ๆœฏๅธˆ็บฆ็ฟฐ้€Š็š„็ƒ่กฃ,็ป™ๆˆ‘ไธ€ไบ›ๅฏนๅบ”่Šๅคฉ่ฎฐๅฝ•?", - ] - - for query in queries: - print("\n" + "=" * 60) - await query_leann_index(index_path, query) - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/packages/leann-core/src/leann/api.py b/packages/leann-core/src/leann/api.py index 9efefde..39710df 100644 --- a/packages/leann-core/src/leann/api.py +++ b/packages/leann-core/src/leann/api.py @@ -459,7 +459,14 @@ class LeannSearcher: self.meta_path_str = f"{index_path}.meta.json" if not Path(self.meta_path_str).exists(): - raise FileNotFoundError(f"Leann metadata file not found at {self.meta_path_str}") + parent_dir = Path(index_path).parent + print( + f"Leann metadata file not found at {self.meta_path_str}, and you may need to rm -rf {parent_dir}" + ) + # highlight in red the filenotfound error + raise FileNotFoundError( + f"Leann metadata file not found at {self.meta_path_str}, \033[91m you may need to rm -rf {parent_dir}\033[0m" + ) with open(self.meta_path_str, encoding="utf-8") as f: self.meta_data = json.load(f) backend_name = self.meta_data["backend_name"] @@ -493,6 +500,16 @@ class LeannSearcher: logger.info(f" Top_k: {top_k}") logger.info(f" Additional kwargs: {kwargs}") + # Smart top_k detection and adjustment + total_docs = len(self.passage_manager.global_offset_map) + original_top_k = top_k + if top_k > total_docs: + top_k = total_docs + logger.warning( + f" โš ๏ธ Requested top_k ({original_top_k}) exceeds total documents ({total_docs})" + ) + logger.warning(f" โœ… Auto-adjusted top_k to {top_k} to match available documents") + zmq_port = None start_time = time.time() diff --git a/pyproject.toml b/pyproject.toml index 906593e..d3b42e3 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -134,6 +134,14 @@ dev = [ "ruff>=0.12.4", ] +[tool.lychee] +accept = ["200", "403", "429", "503"] +timeout = 20 +max_retries = 2 +exclude = ["localhost", "127.0.0.1", "example.com"] +exclude_path = [".git/", ".venv/", "__pycache__/", "third_party/"] +scheme = ["https", "http"] + [tool.pytest.ini_options] testpaths = ["tests"] python_files = ["test_*.py"] diff --git a/test/mail_reader_llamaindex.py b/test/mail_reader_llamaindex.py deleted file mode 100644 index 33fa1e9..0000000 --- a/test/mail_reader_llamaindex.py +++ /dev/null @@ -1,161 +0,0 @@ -import email -import os -from typing import Any - -from llama_index.core import Document, VectorStoreIndex -from llama_index.core.readers.base import BaseReader - - -class EmlxReader(BaseReader): - """ - Apple Mail .emlx file reader. - - Reads individual .emlx files from Apple Mail's storage format. - """ - - def __init__(self) -> None: - """Initialize.""" - pass - - def load_data(self, input_dir: str, **load_kwargs: Any) -> list[Document]: - """ - Load data from the input directory containing .emlx files. - - Args: - input_dir: Directory containing .emlx files - **load_kwargs: - max_count (int): Maximum amount of messages to read. - """ - docs: list[Document] = [] - max_count = load_kwargs.get("max_count", 1000) - count = 0 - - # Walk through the directory recursively - for dirpath, dirnames, filenames in os.walk(input_dir): - # Skip hidden directories - dirnames[:] = [d for d in dirnames if not d.startswith(".")] - - for filename in filenames: - if count >= max_count: - break - - if filename.endswith(".emlx"): - filepath = os.path.join(dirpath, filename) - try: - # Read the .emlx file - with open(filepath, encoding="utf-8", errors="ignore") as f: - content = f.read() - - # .emlx files have a length prefix followed by the email content - # The first line contains the length, followed by the email - lines = content.split("\n", 1) - if len(lines) >= 2: - email_content = lines[1] - - # Parse the email using Python's email module - try: - msg = email.message_from_string(email_content) - - # Extract email metadata - subject = msg.get("Subject", "No Subject") - from_addr = msg.get("From", "Unknown") - to_addr = msg.get("To", "Unknown") - date = msg.get("Date", "Unknown") - - # Extract email body - body = "" - if msg.is_multipart(): - for part in msg.walk(): - if ( - part.get_content_type() == "text/plain" - or part.get_content_type() == "text/html" - ): - body += part.get_payload(decode=True).decode( - "utf-8", errors="ignore" - ) - # break - else: - body = msg.get_payload(decode=True).decode( - "utf-8", errors="ignore" - ) - - # Create document content - doc_content = f""" -From: {from_addr} -To: {to_addr} -Subject: {subject} -Date: {date} - -{body} -""" - - # Create metadata - metadata = { - "file_path": filepath, - "subject": subject, - "from": from_addr, - "to": to_addr, - "date": date, - "filename": filename, - } - if count == 0: - print("--------------------------------") - print("dir path", dirpath) - print(metadata) - print(doc_content) - print("--------------------------------") - body = [] - if msg.is_multipart(): - for part in msg.walk(): - print( - "-------------------------------- get content type -------------------------------" - ) - print(part.get_content_type()) - print(part) - # body.append(part.get_payload(decode=True).decode('utf-8', errors='ignore')) - print( - "-------------------------------- get content type -------------------------------" - ) - else: - body = msg.get_payload(decode=True).decode( - "utf-8", errors="ignore" - ) - print(body) - - print(body) - print("--------------------------------") - doc = Document(text=doc_content, metadata=metadata) - docs.append(doc) - count += 1 - - except Exception as e: - print(f"!!!!!!! Error parsing email from {filepath}: {e} !!!!!!!!") - continue - - except Exception as e: - print(f"!!!!!!! Error reading file !!!!!!!! {filepath}: {e}") - continue - - print(f"Loaded {len(docs)} email documents") - return docs - - -# Use the custom EmlxReader instead of MboxReader -documents = EmlxReader().load_data( - "/Users/yichuan/Library/Mail/V10/0FCA0879-FD8C-4B7E-83BF-FDDA930791C5/[Gmail].mbox/All Mail.mbox/78BA5BE1-8819-4F9A-9613-EB63772F1DD0/Data/9/Messages", - max_count=1000, -) # Returns list of documents - -# Configure the index with larger chunk size to handle long metadata -from llama_index.core.node_parser import SentenceSplitter - -# Create a custom text splitter with larger chunk size -text_splitter = SentenceSplitter(chunk_size=2048, chunk_overlap=200) - -index = VectorStoreIndex.from_documents( - documents, transformations=[text_splitter] -) # Initialize index with documents - -query_engine = index.as_query_engine() -res = query_engine.query("Hows Berkeley Graduate Student Instructor") -print(res) diff --git a/test/mail_reader_save_load.py b/test/mail_reader_save_load.py deleted file mode 100644 index e7fb39f..0000000 --- a/test/mail_reader_save_load.py +++ /dev/null @@ -1,219 +0,0 @@ -import email -import os -from typing import Any - -from llama_index.core import Document, StorageContext, VectorStoreIndex -from llama_index.core.node_parser import SentenceSplitter -from llama_index.core.readers.base import BaseReader - - -class EmlxReader(BaseReader): - """ - Apple Mail .emlx file reader. - - Reads individual .emlx files from Apple Mail's storage format. - """ - - def __init__(self) -> None: - """Initialize.""" - pass - - def load_data(self, input_dir: str, **load_kwargs: Any) -> list[Document]: - """ - Load data from the input directory containing .emlx files. - - Args: - input_dir: Directory containing .emlx files - **load_kwargs: - max_count (int): Maximum amount of messages to read. - """ - docs: list[Document] = [] - max_count = load_kwargs.get("max_count", 1000) - count = 0 - - # Walk through the directory recursively - for dirpath, dirnames, filenames in os.walk(input_dir): - # Skip hidden directories - dirnames[:] = [d for d in dirnames if not d.startswith(".")] - - for filename in filenames: - if count >= max_count: - break - - if filename.endswith(".emlx"): - filepath = os.path.join(dirpath, filename) - try: - # Read the .emlx file - with open(filepath, encoding="utf-8", errors="ignore") as f: - content = f.read() - - # .emlx files have a length prefix followed by the email content - # The first line contains the length, followed by the email - lines = content.split("\n", 1) - if len(lines) >= 2: - email_content = lines[1] - - # Parse the email using Python's email module - try: - msg = email.message_from_string(email_content) - - # Extract email metadata - subject = msg.get("Subject", "No Subject") - from_addr = msg.get("From", "Unknown") - to_addr = msg.get("To", "Unknown") - date = msg.get("Date", "Unknown") - - # Extract email body - body = "" - if msg.is_multipart(): - for part in msg.walk(): - if part.get_content_type() == "text/plain": - body = part.get_payload(decode=True).decode( - "utf-8", errors="ignore" - ) - break - else: - body = msg.get_payload(decode=True).decode( - "utf-8", errors="ignore" - ) - - # Create document content - doc_content = f""" -From: {from_addr} -To: {to_addr} -Subject: {subject} -Date: {date} - -{body} -""" - - # Create metadata - metadata = { - "file_path": filepath, - "subject": subject, - "from": from_addr, - "to": to_addr, - "date": date, - "filename": filename, - } - - doc = Document(text=doc_content, metadata=metadata) - docs.append(doc) - count += 1 - - except Exception as e: - print(f"Error parsing email from {filepath}: {e}") - continue - - except Exception as e: - print(f"Error reading file {filepath}: {e}") - continue - - print(f"Loaded {len(docs)} email documents") - return docs - - -def create_and_save_index(mail_path: str, save_dir: str = "mail_index", max_count: int = 1000): - """ - Create the index from mail data and save it to disk. - - Args: - mail_path: Path to the mail directory - save_dir: Directory to save the index - max_count: Maximum number of emails to process - """ - print("Creating index from mail data...") - - # Load documents - documents = EmlxReader().load_data(mail_path, max_count=max_count) - - if not documents: - print("No documents loaded. Exiting.") - return None - - # Create text splitter - text_splitter = SentenceSplitter(chunk_size=256, chunk_overlap=0) - - # Create index - index = VectorStoreIndex.from_documents(documents, transformations=[text_splitter]) - - # Save the index - os.makedirs(save_dir, exist_ok=True) - index.storage_context.persist(persist_dir=save_dir) - print(f"Index saved to {save_dir}") - - return index - - -def load_index(save_dir: str = "mail_index"): - """ - Load the saved index from disk. - - Args: - save_dir: Directory where the index is saved - - Returns: - Loaded index or None if loading fails - """ - try: - # Load storage context - storage_context = StorageContext.from_defaults(persist_dir=save_dir) - - # Load index - index = VectorStoreIndex.from_vector_store( - storage_context.vector_store, storage_context=storage_context - ) - - print(f"Index loaded from {save_dir}") - return index - - except Exception as e: - print(f"Error loading index: {e}") - return None - - -def query_index(index, query: str): - """ - Query the loaded index. - - Args: - index: The loaded index - query: The query string - """ - if index is None: - print("No index available for querying.") - return - - query_engine = index.as_query_engine() - response = query_engine.query(query) - print(f"Query: {query}") - print(f"Response: {response}") - - -def main(): - mail_path = "/Users/yichuan/Library/Mail/V10/0FCA0879-FD8C-4B7E-83BF-FDDA930791C5/[Gmail].mbox/All Mail.mbox/78BA5BE1-8819-4F9A-9613-EB63772F1DD0/Data/9/Messages" - save_dir = "mail_index" - - # Check if index already exists - if os.path.exists(save_dir) and os.path.exists(os.path.join(save_dir, "vector_store.json")): - print("Loading existing index...") - index = load_index(save_dir) - else: - print("Creating new index...") - index = create_and_save_index(mail_path, save_dir, max_count=1000) - - if index: - # Example queries - queries = [ - "Hows Berkeley Graduate Student Instructor", - "What emails mention GSR appointments?", - "Find emails about deadlines", - ] - - for query in queries: - print("\n" + "=" * 50) - query_index(index, query) - - -if __name__ == "__main__": - main() diff --git a/test/mail_reader_small_chunks.py b/test/mail_reader_small_chunks.py deleted file mode 100644 index 50bd452..0000000 --- a/test/mail_reader_small_chunks.py +++ /dev/null @@ -1,219 +0,0 @@ -import email -import os -from typing import Any - -from llama_index.core import Document, StorageContext, VectorStoreIndex -from llama_index.core.node_parser import SentenceSplitter -from llama_index.core.readers.base import BaseReader - - -class EmlxReader(BaseReader): - """ - Apple Mail .emlx file reader with reduced metadata. - - Reads individual .emlx files from Apple Mail's storage format. - """ - - def __init__(self) -> None: - """Initialize.""" - pass - - def load_data(self, input_dir: str, **load_kwargs: Any) -> list[Document]: - """ - Load data from the input directory containing .emlx files. - - Args: - input_dir: Directory containing .emlx files - **load_kwargs: - max_count (int): Maximum amount of messages to read. - """ - docs: list[Document] = [] - max_count = load_kwargs.get("max_count", 1000) - count = 0 - - # Walk through the directory recursively - for dirpath, dirnames, filenames in os.walk(input_dir): - # Skip hidden directories - dirnames[:] = [d for d in dirnames if not d.startswith(".")] - - for filename in filenames: - if count >= max_count: - break - - if filename.endswith(".emlx"): - filepath = os.path.join(dirpath, filename) - try: - # Read the .emlx file - with open(filepath, encoding="utf-8", errors="ignore") as f: - content = f.read() - - # .emlx files have a length prefix followed by the email content - # The first line contains the length, followed by the email - lines = content.split("\n", 1) - if len(lines) >= 2: - email_content = lines[1] - - # Parse the email using Python's email module - try: - msg = email.message_from_string(email_content) - - # Extract email metadata - subject = msg.get("Subject", "No Subject") - from_addr = msg.get("From", "Unknown") - to_addr = msg.get("To", "Unknown") - date = msg.get("Date", "Unknown") - - # Extract email body - body = "" - if msg.is_multipart(): - for part in msg.walk(): - if part.get_content_type() == "text/plain": - body = part.get_payload(decode=True).decode( - "utf-8", errors="ignore" - ) - break - else: - body = msg.get_payload(decode=True).decode( - "utf-8", errors="ignore" - ) - - # Create document content with metadata embedded in text - doc_content = f""" -From: {from_addr} -To: {to_addr} -Subject: {subject} -Date: {date} - -{body} -""" - - # Create minimal metadata (only essential info) - metadata = { - "subject": subject[:50], # Truncate subject - "from": from_addr[:30], # Truncate from - "date": date[:20], # Truncate date - "filename": filename, # Keep filename - } - - doc = Document(text=doc_content, metadata=metadata) - docs.append(doc) - count += 1 - - except Exception as e: - print(f"Error parsing email from {filepath}: {e}") - continue - - except Exception as e: - print(f"Error reading file {filepath}: {e}") - continue - - print(f"Loaded {len(docs)} email documents") - return docs - - -def create_and_save_index( - mail_path: str, save_dir: str = "mail_index_small", max_count: int = 1000 -): - """ - Create the index from mail data and save it to disk. - - Args: - mail_path: Path to the mail directory - save_dir: Directory to save the index - max_count: Maximum number of emails to process - """ - print("Creating index from mail data with small chunks...") - - # Load documents - documents = EmlxReader().load_data(mail_path, max_count=max_count) - - if not documents: - print("No documents loaded. Exiting.") - return None - - # Create text splitter with small chunk size - text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50) - - # Create index - index = VectorStoreIndex.from_documents(documents, transformations=[text_splitter]) - - # Save the index - os.makedirs(save_dir, exist_ok=True) - index.storage_context.persist(persist_dir=save_dir) - print(f"Index saved to {save_dir}") - - return index - - -def load_index(save_dir: str = "mail_index_small"): - """ - Load the saved index from disk. - - Args: - save_dir: Directory where the index is saved - - Returns: - Loaded index or None if loading fails - """ - try: - # Load storage context - storage_context = StorageContext.from_defaults(persist_dir=save_dir) - - # Load index - index = VectorStoreIndex.from_vector_store( - storage_context.vector_store, storage_context=storage_context - ) - - print(f"Index loaded from {save_dir}") - return index - - except Exception as e: - print(f"Error loading index: {e}") - return None - - -def query_index(index, query: str): - """ - Query the loaded index. - - Args: - index: The loaded index - query: The query string - """ - if index is None: - print("No index available for querying.") - return - - query_engine = index.as_query_engine() - response = query_engine.query(query) - print(f"Query: {query}") - print(f"Response: {response}") - - -def main(): - mail_path = "/Users/yichuan/Library/Mail/V10/0FCA0879-FD8C-4B7E-83BF-FDDA930791C5/[Gmail].mbox/All Mail.mbox/78BA5BE1-8819-4F9A-9613-EB63772F1DD0/Data/9/Messages" - save_dir = "mail_index_small" - - # Check if index already exists - if os.path.exists(save_dir) and os.path.exists(os.path.join(save_dir, "vector_store.json")): - print("Loading existing index...") - index = load_index(save_dir) - else: - print("Creating new index...") - index = create_and_save_index(mail_path, save_dir, max_count=1000) - - if index: - # Example queries - queries = [ - "Hows Berkeley Graduate Student Instructor", - "What emails mention GSR appointments?", - "Find emails about deadlines", - ] - - for query in queries: - print("\n" + "=" * 50) - query_index(index, query) - - -if __name__ == "__main__": - main() diff --git a/test/mail_reader_test.py b/test/mail_reader_test.py deleted file mode 100644 index f94070a..0000000 --- a/test/mail_reader_test.py +++ /dev/null @@ -1,154 +0,0 @@ -import email -import os -from typing import Any - -from llama_index.core import Document, VectorStoreIndex -from llama_index.core.readers.base import BaseReader - - -class EmlxReader(BaseReader): - """ - Apple Mail .emlx file reader. - - Reads individual .emlx files from Apple Mail's storage format. - """ - - def __init__(self) -> None: - """Initialize.""" - pass - - def load_data(self, input_dir: str, **load_kwargs: Any) -> list[Document]: - """ - Load data from the input directory containing .emlx files. - - Args: - input_dir: Directory containing .emlx files - **load_kwargs: - max_count (int): Maximum amount of messages to read. - """ - docs: list[Document] = [] - max_count = load_kwargs.get("max_count", 1000) - count = 0 - - # Check if directory exists and is accessible - if not os.path.exists(input_dir): - print(f"Error: Directory '{input_dir}' does not exist") - return docs - - if not os.access(input_dir, os.R_OK): - print(f"Error: Directory '{input_dir}' is not accessible (permission denied)") - print("This is likely due to macOS security restrictions on Mail app data") - return docs - - print(f"Scanning directory: {input_dir}") - - # Walk through the directory recursively - for dirpath, dirnames, filenames in os.walk(input_dir): - # Skip hidden directories - dirnames[:] = [d for d in dirnames if not d.startswith(".")] - - for filename in filenames: - if count >= max_count: - break - - if filename.endswith(".emlx"): - filepath = os.path.join(dirpath, filename) - print(f"Found .emlx file: {filepath}") - try: - # Read the .emlx file - with open(filepath, encoding="utf-8", errors="ignore") as f: - content = f.read() - - # .emlx files have a length prefix followed by the email content - # The first line contains the length, followed by the email - lines = content.split("\n", 1) - if len(lines) >= 2: - email_content = lines[1] - - # Parse the email using Python's email module - try: - msg = email.message_from_string(email_content) - - # Extract email metadata - subject = msg.get("Subject", "No Subject") - from_addr = msg.get("From", "Unknown") - to_addr = msg.get("To", "Unknown") - date = msg.get("Date", "Unknown") - - # Extract email body - body = "" - if msg.is_multipart(): - for part in msg.walk(): - if part.get_content_type() == "text/plain": - body = part.get_payload(decode=True).decode( - "utf-8", errors="ignore" - ) - break - else: - body = msg.get_payload(decode=True).decode( - "utf-8", errors="ignore" - ) - - # Create document content - doc_content = f""" -From: {from_addr} -To: {to_addr} -Subject: {subject} -Date: {date} - -{body} -""" - - # Create metadata - metadata = { - "file_path": filepath, - "subject": subject, - "from": from_addr, - "to": to_addr, - "date": date, - "filename": filename, - } - - doc = Document(text=doc_content, metadata=metadata) - docs.append(doc) - count += 1 - - except Exception as e: - print(f"Error parsing email from {filepath}: {e}") - continue - - except Exception as e: - print(f"Error reading file {filepath}: {e}") - continue - - print(f"Loaded {len(docs)} email documents") - return docs - - -def main(): - # Use the current directory where the sample.emlx file is located - current_dir = os.path.dirname(os.path.abspath(__file__)) - - print("Testing EmlxReader with sample .emlx file...") - print(f"Scanning directory: {current_dir}") - - # Use the custom EmlxReader - documents = EmlxReader().load_data(current_dir, max_count=1000) - - if not documents: - print("No documents loaded. Make sure sample.emlx exists in the examples directory.") - return - - print(f"\nSuccessfully loaded {len(documents)} document(s)") - - # Initialize index with documents - index = VectorStoreIndex.from_documents(documents) - query_engine = index.as_query_engine() - - print("\nTesting query: 'Hows Berkeley Graduate Student Instructor'") - res = query_engine.query("Hows Berkeley Graduate Student Instructor") - print(f"Response: {res}") - - -if __name__ == "__main__": - main() diff --git a/test/query_saved_index.py b/test/query_saved_index.py deleted file mode 100644 index dfd3295..0000000 --- a/test/query_saved_index.py +++ /dev/null @@ -1,105 +0,0 @@ -import os - -from llama_index.core import StorageContext, VectorStoreIndex - - -def load_index(save_dir: str = "mail_index"): - """ - Load the saved index from disk. - - Args: - save_dir: Directory where the index is saved - - Returns: - Loaded index or None if loading fails - """ - try: - # Load storage context - storage_context = StorageContext.from_defaults(persist_dir=save_dir) - - # Load index - index = VectorStoreIndex.from_vector_store( - storage_context.vector_store, storage_context=storage_context - ) - - print(f"Index loaded from {save_dir}") - return index - - except Exception as e: - print(f"Error loading index: {e}") - return None - - -def query_index(index, query: str): - """ - Query the loaded index. - - Args: - index: The loaded index - query: The query string - """ - if index is None: - print("No index available for querying.") - return - - query_engine = index.as_query_engine() - response = query_engine.query(query) - print(f"\nQuery: {query}") - print(f"Response: {response}") - - -def main(): - save_dir = "mail_index" - - # Check if index exists - if not os.path.exists(save_dir) or not os.path.exists( - os.path.join(save_dir, "vector_store.json") - ): - print(f"Index not found in {save_dir}") - print("Please run mail_reader_save_load.py first to create the index.") - return - - # Load the index - index = load_index(save_dir) - - if not index: - print("Failed to load index.") - return - - print("\n" + "=" * 60) - print("Email Query Interface") - print("=" * 60) - print("Type 'quit' to exit") - print("Type 'help' for example queries") - print("=" * 60) - - # Interactive query loop - while True: - try: - query = input("\nEnter your query: ").strip() - - if query.lower() == "quit": - print("Goodbye!") - break - elif query.lower() == "help": - print("\nExample queries:") - print("- Hows Berkeley Graduate Student Instructor") - print("- What emails mention GSR appointments?") - print("- Find emails about deadlines") - print("- Search for emails from specific sender") - print("- Find emails about meetings") - continue - elif not query: - continue - - query_index(index, query) - - except KeyboardInterrupt: - print("\nGoodbye!") - break - except Exception as e: - print(f"Error processing query: {e}") - - -if __name__ == "__main__": - main() diff --git a/test/sanity_checks/debug_zmq_issue.py b/test/sanity_checks/debug_zmq_issue.py deleted file mode 100644 index d1bd156..0000000 --- a/test/sanity_checks/debug_zmq_issue.py +++ /dev/null @@ -1,117 +0,0 @@ -#!/usr/bin/env python3 -""" -Debug script to test ZMQ communication with the exact same setup as main_cli_example.py -""" - -import sys -import time - -import zmq - -sys.path.append("packages/leann-backend-diskann") -from leann_backend_diskann import embedding_pb2 - - -def test_zmq_with_same_model(): - print("=== Testing ZMQ with same model as main_cli_example.py ===") - - # Test the exact same model that main_cli_example.py uses - model_name = "sentence-transformers/all-mpnet-base-v2" - - # Start server with the same model - import subprocess - - server_cmd = [ - sys.executable, - "-m", - "packages.leann-backend-diskann.leann_backend_diskann.embedding_server", - "--zmq-port", - "5556", # Use different port to avoid conflicts - "--model-name", - model_name, - ] - - print(f"Starting server with command: {' '.join(server_cmd)}") - server_process = subprocess.Popen( - server_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True - ) - - # Wait for server to start - print("Waiting for server to start...") - time.sleep(10) - - # Check if server is running - if server_process.poll() is not None: - stdout, stderr = server_process.communicate() - print(f"Server failed to start. stdout: {stdout}") - print(f"Server failed to start. stderr: {stderr}") - return False - - print(f"Server started with PID: {server_process.pid}") - - try: - # Test client - context = zmq.Context() - socket = context.socket(zmq.REQ) - socket.connect("tcp://127.0.0.1:5556") - socket.setsockopt(zmq.RCVTIMEO, 30000) # 30 second timeout like C++ - socket.setsockopt(zmq.SNDTIMEO, 30000) - - # Create request with same format as C++ - request = embedding_pb2.NodeEmbeddingRequest() - request.node_ids.extend([0, 1, 2, 3, 4]) # Test with some node IDs - - print(f"Sending request with {len(request.node_ids)} node IDs...") - start_time = time.time() - - # Send request - socket.send(request.SerializeToString()) - - # Receive response - response_data = socket.recv() - end_time = time.time() - - print(f"Received response in {end_time - start_time:.3f} seconds") - print(f"Response size: {len(response_data)} bytes") - - # Parse response - response = embedding_pb2.NodeEmbeddingResponse() - response.ParseFromString(response_data) - - print(f"Response dimensions: {list(response.dimensions)}") - print(f"Embeddings data size: {len(response.embeddings_data)} bytes") - print(f"Missing IDs: {list(response.missing_ids)}") - - # Calculate expected size - if len(response.dimensions) == 2: - batch_size = response.dimensions[0] - embedding_dim = response.dimensions[1] - expected_bytes = batch_size * embedding_dim * 4 # 4 bytes per float - print(f"Expected bytes: {expected_bytes}, Actual: {len(response.embeddings_data)}") - - if len(response.embeddings_data) == expected_bytes: - print("โœ… Response format is correct!") - return True - else: - print("โŒ Response format mismatch!") - return False - else: - print("โŒ Invalid response dimensions!") - return False - - except Exception as e: - print(f"โŒ Error during ZMQ test: {e}") - return False - finally: - # Clean up - server_process.terminate() - server_process.wait() - print("Server terminated") - - -if __name__ == "__main__": - success = test_zmq_with_same_model() - if success: - print("\nโœ… ZMQ communication test passed!") - else: - print("\nโŒ ZMQ communication test failed!") diff --git a/tests/README.md b/tests/README.md index 06274fd..22822bd 100644 --- a/tests/README.md +++ b/tests/README.md @@ -18,8 +18,8 @@ Basic functionality tests that verify: - Basic index building and searching works for both HNSW and DiskANN backends - Uses parametrized tests to test both backends -### `test_main_cli.py` -Tests the main CLI example functionality: +### `test_document_rag.py` +Tests the document RAG example functionality: - Tests with facebook/contriever embeddings - Tests with OpenAI embeddings (if API key is available) - Tests error handling with invalid parameters diff --git a/tests/test_ci_minimal.py b/tests/test_ci_minimal.py index 4207802..b884cbe 100644 --- a/tests/test_ci_minimal.py +++ b/tests/test_ci_minimal.py @@ -20,7 +20,7 @@ def test_package_imports(): def test_cli_help(): """Test that CLI example shows help.""" result = subprocess.run( - [sys.executable, "examples/main_cli_example.py", "--help"], capture_output=True, text=True + [sys.executable, "apps/document_rag.py", "--help"], capture_output=True, text=True ) assert result.returncode == 0 diff --git a/tests/test_main_cli.py b/tests/test_document_rag.py similarity index 82% rename from tests/test_main_cli.py rename to tests/test_document_rag.py index 4eb0e9f..97c5700 100644 --- a/tests/test_main_cli.py +++ b/tests/test_document_rag.py @@ -1,5 +1,5 @@ """ -Test main_cli_example functionality using pytest. +Test document_rag functionality using pytest. """ import os @@ -14,20 +14,20 @@ import pytest @pytest.fixture def test_data_dir(): """Return the path to test data directory.""" - return Path("examples/data") + return Path("data") @pytest.mark.skipif( os.environ.get("CI") == "true", reason="Skip model tests in CI to avoid MPS memory issues" ) -def test_main_cli_simulated(test_data_dir): - """Test main_cli with simulated LLM.""" +def test_document_rag_simulated(test_data_dir): + """Test document_rag with simulated LLM.""" with tempfile.TemporaryDirectory() as temp_dir: # Use a subdirectory that doesn't exist yet to force index creation index_dir = Path(temp_dir) / "test_index" cmd = [ sys.executable, - "examples/main_cli_example.py", + "apps/document_rag.py", "--llm", "simulated", "--embedding-model", @@ -53,19 +53,19 @@ def test_main_cli_simulated(test_data_dir): # Verify output output = result.stdout + result.stderr - assert "Leann index built at" in output or "Using existing index" in output + assert "Index saved to" in output or "Using existing index" in output assert "This is a simulated answer" in output @pytest.mark.skipif(not os.environ.get("OPENAI_API_KEY"), reason="OpenAI API key not available") -def test_main_cli_openai(test_data_dir): - """Test main_cli with OpenAI embeddings.""" +def test_document_rag_openai(test_data_dir): + """Test document_rag with OpenAI embeddings.""" with tempfile.TemporaryDirectory() as temp_dir: # Use a subdirectory that doesn't exist yet to force index creation index_dir = Path(temp_dir) / "test_index_openai" cmd = [ sys.executable, - "examples/main_cli_example.py", + "apps/document_rag.py", "--llm", "simulated", # Use simulated LLM to avoid GPT-4 costs "--embedding-model", @@ -99,12 +99,12 @@ def test_main_cli_openai(test_data_dir): ) -def test_main_cli_error_handling(test_data_dir): - """Test main_cli with invalid parameters.""" +def test_document_rag_error_handling(test_data_dir): + """Test document_rag with invalid parameters.""" with tempfile.TemporaryDirectory() as temp_dir: cmd = [ sys.executable, - "examples/main_cli_example.py", + "apps/document_rag.py", "--llm", "invalid_llm_type", "--index-dir", @@ -117,4 +117,4 @@ def test_main_cli_error_handling(test_data_dir): # Should fail with invalid LLM type assert result.returncode != 0 - assert "Unknown LLM type" in result.stderr or "invalid_llm_type" in result.stderr + assert "invalid choice" in result.stderr or "invalid_llm_type" in result.stderr