Files
LEANN/docs/ast_chunking_guide.md
Gabriel Dehan 13bb561aad Add AST-aware code chunking for better code understanding (#58)
* feat(core): Add AST-aware code chunking with astchunk integration

This PR introduces intelligent code chunking that preserves semantic boundaries
(functions, classes, methods) for better code understanding in RAG applications.

Key Features:
- AST-aware chunking for Python, Java, C#, TypeScript files
- Graceful fallback to traditional chunking for unsupported languages
- New specialized code RAG application for repositories
- Enhanced CLI with --use-ast-chunking flag
- Comprehensive test suite with integration tests

Technical Implementation:
- New chunking_utils.py module with enhanced chunking logic
- Extended base RAG framework with AST chunking arguments
- Updated document RAG with --enable-code-chunking flag
- CLI integration with proper error handling and fallback

Benefits:
- Better semantic understanding of code structure
- Improved search quality for code-related queries
- Maintains backward compatibility with existing workflows
- Supports mixed content (code + documentation) seamlessly

Dependencies:
- Added astchunk and tree-sitter parsers to pyproject.toml
- All dependencies are optional - fallback works without them

Testing:
- Comprehensive test suite in test_astchunk_integration.py
- Integration tests with document RAG
- Error handling and edge case coverage

Documentation:
- Updated README.md with AST chunking highlights
- Added ASTCHUNK_INTEGRATION.md with complete guide
- Updated features.md with new capabilities

* Refactored chunk utils

* Remove useless import

* Update README.md

* Update apps/chunking/utils.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update apps/code_rag.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix issue

* apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fixes after pr review

* Fix tests not passing

* Fix linter error for documentation files

* Update .gitignore with unwanted files

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Andy Lee <andylizf@outlook.com>
2025-08-19 23:35:31 -07:00

3.2 KiB

AST-Aware Code chunking guide

Overview

This guide covers best practices for using AST-aware code chunking in LEANN. AST chunking provides better semantic understanding of code structure compared to traditional text-based chunking.

Quick Start

Basic Usage

# Enable AST chunking for mixed content (code + docs)
python -m apps.document_rag --enable-code-chunking --data-dir ./my_project

# Specialized code repository indexing
python -m apps.code_rag --repo-dir ./my_codebase

# Global CLI with AST support
leann build my-code-index --docs ./src --use-ast-chunking

Installation

# Install LEANN with AST chunking support
uv pip install -e "."

Best Practices

When to Use AST Chunking

Recommended for:

  • Code repositories with multiple languages
  • Mixed documentation and code content
  • Complex codebases with deep function/class hierarchies
  • When working with Claude Code for code assistance

Not recommended for:

  • Pure text documents
  • Very large files (>1MB)
  • Languages not supported by tree-sitter

Optimal Configuration

# Recommended settings for most codebases
python -m apps.code_rag \
    --repo-dir ./src \
    --ast-chunk-size 768 \
    --ast-chunk-overlap 96 \
    --exclude-dirs .git __pycache__ node_modules build dist

Supported Languages

Extension Language Status
.py Python Full support
.java Java Full support
.cs C# Full support
.ts, .tsx TypeScript Full support
.js, .jsx JavaScript Via TypeScript parser

Integration Examples

Document RAG with Code Support

# Enable code chunking in document RAG
python -m apps.document_rag \
    --enable-code-chunking \
    --data-dir ./project \
    --query "How does authentication work in the codebase?"

Claude Code Integration

When using with Claude Code MCP server, AST chunking provides better context for:

  • Code completion and suggestions
  • Bug analysis and debugging
  • Architecture understanding
  • Refactoring assistance

Troubleshooting

Common Issues

  1. Fallback to Traditional Chunking

    • Normal behavior for unsupported languages
    • Check logs for specific language support
  2. Performance with Large Files

    • Adjust --max-file-size parameter
    • Use --exclude-dirs to skip unnecessary directories
  3. Quality Issues

    • Try different --ast-chunk-size values (512, 768, 1024)
    • Adjust overlap for better context preservation

Debug Mode

export LEANN_LOG_LEVEL=DEBUG
python -m apps.code_rag --repo-dir ./my_code

Migration from Traditional Chunking

Existing workflows continue to work without changes. To enable AST chunking:

# Before
python -m apps.document_rag --chunk-size 256

# After (maintains traditional chunking for non-code files)
python -m apps.document_rag --enable-code-chunking --chunk-size 256 --ast-chunk-size 768

References


Note: AST chunking maintains full backward compatibility while enhancing code understanding capabilities.