Merge branch 'main' into arch-eval

2025-08-20 12:25:47 -07:00
parent 7ea34bd7d0 13bb561aad
commit c8f173c0e5
13 changed files with 1496 additions and 35 deletions
--- a/README.md
+++ b/README.md
@@ -176,6 +176,9 @@ response = chat.ask("How much storage does LEANN save?", top_k=1)

 LEANN supports RAG on various data sources including documents (`.pdf`, `.txt`, `.md`), Apple Mail, Google Search History, WeChat, and more.

+**AST-Aware Code Chunking** - LEANN also features intelligent code chunking that preserves semantic boundaries (functions, classes, methods) for Python, Java, C#, and TypeScript files, providing improved code understanding compared to traditional text-based approaches.
+📖 Read the [AST Chunking Guide →](docs/ast_chunking_guide.md) to learn more.
+
 ### Generation Model Setup

 LEANN supports multiple LLM providers for text generation (OpenAI API, HuggingFace, Ollama).
@@ -294,6 +297,12 @@ python -m apps.document_rag --data-dir "~/Documents/Papers" --chunk-size 1024

 # Filter only markdown and Python files with smaller chunks
 python -m apps.document_rag --data-dir "./docs" --chunk-size 256 --file-types .md .py
+
+# Enable AST-aware chunking for code files
+python -m apps.document_rag --enable-code-chunking --data-dir "./my_project"
+
+# Or use the specialized code RAG for better code understanding
+python -m apps.code_rag --repo-dir "./my_codebase" --query "How does authentication work?"
 ```

 </details>
@@ -472,6 +481,7 @@ Once the index is built, you can ask questions like:

 **Key features:**
 - 🔍 **Semantic code search** across your entire project, fully local index and lightweight
+- 🧠 **AST-aware chunking** preserves code structure (functions, classes)
 - 📚 **Context-aware assistance** for debugging and development
 - 🚀 **Zero-config setup** with automatic language detection

@@ -534,7 +544,8 @@ leann remove my-docs

 **Key CLI features:**
 - Auto-detects document formats (PDF, TXT, MD, DOCX, PPTX + code files)
- Smart text chunking with overlap
+- **🧠 AST-aware chunking** for Python, Java, C#, TypeScript files
+- Smart text chunking with overlap for all other content
 - Multiple LLM providers (Ollama, OpenAI, HuggingFace)
 - Organized index storage in `.leann/indexes/` (project-local)
 - Support for advanced search parameters