Files

Gabriel Dehan 13bb561aad Add AST-aware code chunking for better code understanding (#58 )

* feat(core): Add AST-aware code chunking with astchunk integration

This PR introduces intelligent code chunking that preserves semantic boundaries
(functions, classes, methods) for better code understanding in RAG applications.

Key Features:
- AST-aware chunking for Python, Java, C#, TypeScript files
- Graceful fallback to traditional chunking for unsupported languages
- New specialized code RAG application for repositories
- Enhanced CLI with --use-ast-chunking flag
- Comprehensive test suite with integration tests

Technical Implementation:
- New chunking_utils.py module with enhanced chunking logic
- Extended base RAG framework with AST chunking arguments
- Updated document RAG with --enable-code-chunking flag
- CLI integration with proper error handling and fallback

Benefits:
- Better semantic understanding of code structure
- Improved search quality for code-related queries
- Maintains backward compatibility with existing workflows
- Supports mixed content (code + documentation) seamlessly

Dependencies:
- Added astchunk and tree-sitter parsers to pyproject.toml
- All dependencies are optional - fallback works without them

Testing:
- Comprehensive test suite in test_astchunk_integration.py
- Integration tests with document RAG
- Error handling and edge case coverage

Documentation:
- Updated README.md with AST chunking highlights
- Added ASTCHUNK_INTEGRATION.md with complete guide
- Updated features.md with new capabilities

* Refactored chunk utils

* Remove useless import

* Update README.md

* Update apps/chunking/utils.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update apps/code_rag.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix issue

* apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fixes after pr review

* Fix tests not passing

* Fix linter error for documentation files

* Update .gitignore with unwanted files

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Andy Lee <andylizf@outlook.com>

2025-08-19 23:35:31 -07:00

README.md

feat(core,diskann): robust embedding server (no-hang) + DiskANN fast mode (graph partition) (#29 )

2025-08-14 01:02:24 -07:00

test_astchunk_integration.py

Add AST-aware code chunking for better code understanding (#58 )

2025-08-19 23:35:31 -07:00

test_basic.py

feat(core,diskann): robust embedding server (no-hang) + DiskANN fast mode (graph partition) (#29 )

2025-08-14 01:02:24 -07:00

test_ci_minimal.py

refactor: Unify examples interface with BaseRAGExample (#12 )

2025-08-03 23:06:24 -07:00

test_diskann_partition.py

feat(core,diskann): robust embedding server (no-hang) + DiskANN fast mode (graph partition) (#29 )

2025-08-14 01:02:24 -07:00

test_document_rag.py

Add AST-aware code chunking for better code understanding (#58 )

2025-08-19 23:35:31 -07:00

test_readme_examples.py

feat(core,diskann): robust embedding server (no-hang) + DiskANN fast mode (graph partition) (#29 )

2025-08-14 01:02:24 -07:00

README.md

LEANN Tests

This directory contains automated tests for the LEANN project using pytest.

Test Files

`test_readme_examples.py`

Tests the examples shown in README.md:

The basic example code that users see first (parametrized for both HNSW and DiskANN backends)
Import statements work correctly
Different backend options (HNSW, DiskANN)
Different LLM configuration options (parametrized for both backends)
All main README examples are tested with both HNSW and DiskANN backends using pytest parametrization

`test_basic.py`

Basic functionality tests that verify:

All packages can be imported correctly
C++ extensions (FAISS, DiskANN) load properly
Basic index building and searching works for both HNSW and DiskANN backends
Uses parametrized tests to test both backends

`test_document_rag.py`

Tests the document RAG example functionality:

Tests with facebook/contriever embeddings
Tests with OpenAI embeddings (if API key is available)
Tests error handling with invalid parameters
Verifies that normalized embeddings are detected and cosine distance is used

`test_diskann_partition.py`

Tests DiskANN graph partitioning functionality:

Tests DiskANN index building without partitioning (baseline)
Tests automatic graph partitioning with is_recompute=True
Verifies that partition files are created and large files are cleaned up for storage saving
Tests search functionality with partitioned indices
Validates medoid and max_base_norm file generation and usage
Includes performance comparison between DiskANN (with partition) and HNSW
Note: These tests are skipped in CI due to hardware requirements and computation time

Running Tests

Install test dependencies:

# Using extras
uv pip install -e ".[test]"

Run all tests:

pytest tests/

# Or with coverage
pytest tests/ --cov=leann --cov-report=html

# Run in parallel (faster)
pytest tests/ -n auto

Run specific tests:

# Only basic tests
pytest tests/test_basic.py

# Only tests that don't require OpenAI
pytest tests/ -m "not openai"

# Skip slow tests
pytest tests/ -m "not slow"

# Run DiskANN partition tests (requires local machine, not CI)
pytest tests/test_diskann_partition.py

Run with specific backend:

# Test only HNSW backend
pytest tests/test_basic.py::test_backend_basic[hnsw]
pytest tests/test_readme_examples.py::test_readme_basic_example[hnsw]

# Test only DiskANN backend
pytest tests/test_basic.py::test_backend_basic[diskann]
pytest tests/test_readme_examples.py::test_readme_basic_example[diskann]

# All DiskANN tests (parametrized + specialized partition tests)
pytest tests/ -k diskann

CI/CD Integration

Tests are automatically run in GitHub Actions:

After building wheel packages
On multiple Python versions (3.9 - 3.13)
On both Ubuntu and macOS
Using pytest with appropriate markers and flags

pytest.ini Configuration

The pytest.ini file configures:

Test discovery paths
Default timeout (600 seconds)
Environment variables (HF_HUB_DISABLE_SYMLINKS, TOKENIZERS_PARALLELISM)
Custom markers for slow and OpenAI tests
Verbose output with short tracebacks

Known Issues

OpenAI tests are automatically skipped if no API key is provided