* feat(core): Add AST-aware code chunking with astchunk integration This PR introduces intelligent code chunking that preserves semantic boundaries (functions, classes, methods) for better code understanding in RAG applications. Key Features: - AST-aware chunking for Python, Java, C#, TypeScript files - Graceful fallback to traditional chunking for unsupported languages - New specialized code RAG application for repositories - Enhanced CLI with --use-ast-chunking flag - Comprehensive test suite with integration tests Technical Implementation: - New chunking_utils.py module with enhanced chunking logic - Extended base RAG framework with AST chunking arguments - Updated document RAG with --enable-code-chunking flag - CLI integration with proper error handling and fallback Benefits: - Better semantic understanding of code structure - Improved search quality for code-related queries - Maintains backward compatibility with existing workflows - Supports mixed content (code + documentation) seamlessly Dependencies: - Added astchunk and tree-sitter parsers to pyproject.toml - All dependencies are optional - fallback works without them Testing: - Comprehensive test suite in test_astchunk_integration.py - Integration tests with document RAG - Error handling and edge case coverage Documentation: - Updated README.md with AST chunking highlights - Added ASTCHUNK_INTEGRATION.md with complete guide - Updated features.md with new capabilities * Refactored chunk utils * Remove useless import * Update README.md * Update apps/chunking/utils.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update apps/code_rag.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix issue * apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fixes after pr review * Fix tests not passing * Fix linter error for documentation files * Update .gitignore with unwanted files --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Andy Lee <andylizf@outlook.com>
129 lines
3.2 KiB
Markdown
129 lines
3.2 KiB
Markdown
# AST-Aware Code chunking guide
|
|
|
|
## Overview
|
|
|
|
This guide covers best practices for using AST-aware code chunking in LEANN. AST chunking provides better semantic understanding of code structure compared to traditional text-based chunking.
|
|
|
|
## Quick Start
|
|
|
|
### Basic Usage
|
|
|
|
```bash
|
|
# Enable AST chunking for mixed content (code + docs)
|
|
python -m apps.document_rag --enable-code-chunking --data-dir ./my_project
|
|
|
|
# Specialized code repository indexing
|
|
python -m apps.code_rag --repo-dir ./my_codebase
|
|
|
|
# Global CLI with AST support
|
|
leann build my-code-index --docs ./src --use-ast-chunking
|
|
```
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# Install LEANN with AST chunking support
|
|
uv pip install -e "."
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### When to Use AST Chunking
|
|
|
|
✅ **Recommended for:**
|
|
- Code repositories with multiple languages
|
|
- Mixed documentation and code content
|
|
- Complex codebases with deep function/class hierarchies
|
|
- When working with Claude Code for code assistance
|
|
|
|
❌ **Not recommended for:**
|
|
- Pure text documents
|
|
- Very large files (>1MB)
|
|
- Languages not supported by tree-sitter
|
|
|
|
### Optimal Configuration
|
|
|
|
```bash
|
|
# Recommended settings for most codebases
|
|
python -m apps.code_rag \
|
|
--repo-dir ./src \
|
|
--ast-chunk-size 768 \
|
|
--ast-chunk-overlap 96 \
|
|
--exclude-dirs .git __pycache__ node_modules build dist
|
|
```
|
|
|
|
### Supported Languages
|
|
|
|
| Extension | Language | Status |
|
|
|-----------|----------|--------|
|
|
| `.py` | Python | ✅ Full support |
|
|
| `.java` | Java | ✅ Full support |
|
|
| `.cs` | C# | ✅ Full support |
|
|
| `.ts`, `.tsx` | TypeScript | ✅ Full support |
|
|
| `.js`, `.jsx` | JavaScript | ✅ Via TypeScript parser |
|
|
|
|
## Integration Examples
|
|
|
|
### Document RAG with Code Support
|
|
|
|
```python
|
|
# Enable code chunking in document RAG
|
|
python -m apps.document_rag \
|
|
--enable-code-chunking \
|
|
--data-dir ./project \
|
|
--query "How does authentication work in the codebase?"
|
|
```
|
|
|
|
### Claude Code Integration
|
|
|
|
When using with Claude Code MCP server, AST chunking provides better context for:
|
|
- Code completion and suggestions
|
|
- Bug analysis and debugging
|
|
- Architecture understanding
|
|
- Refactoring assistance
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **Fallback to Traditional Chunking**
|
|
- Normal behavior for unsupported languages
|
|
- Check logs for specific language support
|
|
|
|
2. **Performance with Large Files**
|
|
- Adjust `--max-file-size` parameter
|
|
- Use `--exclude-dirs` to skip unnecessary directories
|
|
|
|
3. **Quality Issues**
|
|
- Try different `--ast-chunk-size` values (512, 768, 1024)
|
|
- Adjust overlap for better context preservation
|
|
|
|
### Debug Mode
|
|
|
|
```bash
|
|
export LEANN_LOG_LEVEL=DEBUG
|
|
python -m apps.code_rag --repo-dir ./my_code
|
|
```
|
|
|
|
## Migration from Traditional Chunking
|
|
|
|
Existing workflows continue to work without changes. To enable AST chunking:
|
|
|
|
```bash
|
|
# Before
|
|
python -m apps.document_rag --chunk-size 256
|
|
|
|
# After (maintains traditional chunking for non-code files)
|
|
python -m apps.document_rag --enable-code-chunking --chunk-size 256 --ast-chunk-size 768
|
|
```
|
|
|
|
## References
|
|
|
|
- [astchunk GitHub Repository](https://github.com/yilinjz/astchunk)
|
|
- [LEANN MCP Integration](../packages/leann-mcp/README.md)
|
|
- [Research Paper](https://arxiv.org/html/2506.15655v1)
|
|
|
|
---
|
|
|
|
**Note**: AST chunking maintains full backward compatibility while enhancing code understanding capabilities.
|