* feat(core): Add AST-aware code chunking with astchunk integration This PR introduces intelligent code chunking that preserves semantic boundaries (functions, classes, methods) for better code understanding in RAG applications. Key Features: - AST-aware chunking for Python, Java, C#, TypeScript files - Graceful fallback to traditional chunking for unsupported languages - New specialized code RAG application for repositories - Enhanced CLI with --use-ast-chunking flag - Comprehensive test suite with integration tests Technical Implementation: - New chunking_utils.py module with enhanced chunking logic - Extended base RAG framework with AST chunking arguments - Updated document RAG with --enable-code-chunking flag - CLI integration with proper error handling and fallback Benefits: - Better semantic understanding of code structure - Improved search quality for code-related queries - Maintains backward compatibility with existing workflows - Supports mixed content (code + documentation) seamlessly Dependencies: - Added astchunk and tree-sitter parsers to pyproject.toml - All dependencies are optional - fallback works without them Testing: - Comprehensive test suite in test_astchunk_integration.py - Integration tests with document RAG - Error handling and edge case coverage Documentation: - Updated README.md with AST chunking highlights - Added ASTCHUNK_INTEGRATION.md with complete guide - Updated features.md with new capabilities * Refactored chunk utils * Remove useless import * Update README.md * Update apps/chunking/utils.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update apps/code_rag.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix issue * apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fixes after pr review * Fix tests not passing * Fix linter error for documentation files * Update .gitignore with unwanted files --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Andy Lee <andylizf@outlook.com>
1.7 KiB
1.7 KiB
✨ Detailed Features
🔥 Core Features
- 🔄 Real-time Embeddings - Eliminate heavy embedding storage with dynamic computation using optimized ZMQ servers and highly optimized search paradigm (overlapping and batching) with highly optimized embedding engine
- 🧠 AST-Aware Code Chunking - Intelligent code chunking that preserves semantic boundaries (functions, classes, methods) for Python, Java, C#, and TypeScript files
- 📈 Scalable Architecture - Handles millions of documents on consumer hardware; the larger your dataset, the more LEANN can save
- 🎯 Graph Pruning - Advanced techniques to minimize the storage overhead of vector search to a limited footprint
- 🏗️ Pluggable Backends - HNSW/FAISS (default), with optional DiskANN for large-scale deployments
🛠️ Technical Highlights
- 🔄 Recompute Mode - Highest accuracy scenarios while eliminating vector storage overhead
- ⚡ Zero-copy Operations - Minimize IPC overhead by transferring distances instead of embeddings
- 🚀 High-throughput Embedding Pipeline - Optimized batched processing for maximum efficiency
- 🎯 Two-level Search - Novel coarse-to-fine search overlap for accelerated query processing (optional)
- 💾 Memory-mapped Indices - Fast startup with raw text mapping to reduce memory overhead
- 🚀 MLX Support - Ultra-fast recompute/build with quantized embedding models, accelerating building and search (minimal example)
🎨 Developer Experience
- Simple Python API - Get started in minutes
- Extensible backend system - Easy to add new algorithms
- Comprehensive examples - From basic usage to production deployment