Add AST-aware code chunking for better code understanding (#58)
* feat(core): Add AST-aware code chunking with astchunk integration This PR introduces intelligent code chunking that preserves semantic boundaries (functions, classes, methods) for better code understanding in RAG applications. Key Features: - AST-aware chunking for Python, Java, C#, TypeScript files - Graceful fallback to traditional chunking for unsupported languages - New specialized code RAG application for repositories - Enhanced CLI with --use-ast-chunking flag - Comprehensive test suite with integration tests Technical Implementation: - New chunking_utils.py module with enhanced chunking logic - Extended base RAG framework with AST chunking arguments - Updated document RAG with --enable-code-chunking flag - CLI integration with proper error handling and fallback Benefits: - Better semantic understanding of code structure - Improved search quality for code-related queries - Maintains backward compatibility with existing workflows - Supports mixed content (code + documentation) seamlessly Dependencies: - Added astchunk and tree-sitter parsers to pyproject.toml - All dependencies are optional - fallback works without them Testing: - Comprehensive test suite in test_astchunk_integration.py - Integration tests with document RAG - Error handling and edge case coverage Documentation: - Updated README.md with AST chunking highlights - Added ASTCHUNK_INTEGRATION.md with complete guide - Updated features.md with new capabilities * Refactored chunk utils * Remove useless import * Update README.md * Update apps/chunking/utils.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update apps/code_rag.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix issue * apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fixes after pr review * Fix tests not passing * Fix linter error for documentation files * Update .gitignore with unwanted files --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Andy Lee <andylizf@outlook.com>
This commit is contained in:
13
README.md
13
README.md
@@ -176,6 +176,9 @@ response = chat.ask("How much storage does LEANN save?", top_k=1)
|
||||
|
||||
LEANN supports RAG on various data sources including documents (`.pdf`, `.txt`, `.md`), Apple Mail, Google Search History, WeChat, and more.
|
||||
|
||||
**AST-Aware Code Chunking** - LEANN also features intelligent code chunking that preserves semantic boundaries (functions, classes, methods) for Python, Java, C#, and TypeScript files, providing improved code understanding compared to traditional text-based approaches.
|
||||
📖 Read the [AST Chunking Guide →](docs/ast_chunking_guide.md) to learn more.
|
||||
|
||||
### Generation Model Setup
|
||||
|
||||
LEANN supports multiple LLM providers for text generation (OpenAI API, HuggingFace, Ollama).
|
||||
@@ -294,6 +297,12 @@ python -m apps.document_rag --data-dir "~/Documents/Papers" --chunk-size 1024
|
||||
|
||||
# Filter only markdown and Python files with smaller chunks
|
||||
python -m apps.document_rag --data-dir "./docs" --chunk-size 256 --file-types .md .py
|
||||
|
||||
# Enable AST-aware chunking for code files
|
||||
python -m apps.document_rag --enable-code-chunking --data-dir "./my_project"
|
||||
|
||||
# Or use the specialized code RAG for better code understanding
|
||||
python -m apps.code_rag --repo-dir "./my_codebase" --query "How does authentication work?"
|
||||
```
|
||||
|
||||
</details>
|
||||
@@ -472,6 +481,7 @@ Once the index is built, you can ask questions like:
|
||||
|
||||
**Key features:**
|
||||
- 🔍 **Semantic code search** across your entire project, fully local index and lightweight
|
||||
- 🧠 **AST-aware chunking** preserves code structure (functions, classes)
|
||||
- 📚 **Context-aware assistance** for debugging and development
|
||||
- 🚀 **Zero-config setup** with automatic language detection
|
||||
|
||||
@@ -534,7 +544,8 @@ leann remove my-docs
|
||||
|
||||
**Key CLI features:**
|
||||
- Auto-detects document formats (PDF, TXT, MD, DOCX, PPTX + code files)
|
||||
- Smart text chunking with overlap
|
||||
- **🧠 AST-aware chunking** for Python, Java, C#, TypeScript files
|
||||
- Smart text chunking with overlap for all other content
|
||||
- Multiple LLM providers (Ollama, OpenAI, HuggingFace)
|
||||
- Organized index storage in `.leann/indexes/` (project-local)
|
||||
- Support for advanced search parameters
|
||||
|
||||
Reference in New Issue
Block a user