Files
LEANN/.gitignore
Gabriel Dehan 13bb561aad Add AST-aware code chunking for better code understanding (#58)
* feat(core): Add AST-aware code chunking with astchunk integration

This PR introduces intelligent code chunking that preserves semantic boundaries
(functions, classes, methods) for better code understanding in RAG applications.

Key Features:
- AST-aware chunking for Python, Java, C#, TypeScript files
- Graceful fallback to traditional chunking for unsupported languages
- New specialized code RAG application for repositories
- Enhanced CLI with --use-ast-chunking flag
- Comprehensive test suite with integration tests

Technical Implementation:
- New chunking_utils.py module with enhanced chunking logic
- Extended base RAG framework with AST chunking arguments
- Updated document RAG with --enable-code-chunking flag
- CLI integration with proper error handling and fallback

Benefits:
- Better semantic understanding of code structure
- Improved search quality for code-related queries
- Maintains backward compatibility with existing workflows
- Supports mixed content (code + documentation) seamlessly

Dependencies:
- Added astchunk and tree-sitter parsers to pyproject.toml
- All dependencies are optional - fallback works without them

Testing:
- Comprehensive test suite in test_astchunk_integration.py
- Integration tests with document RAG
- Error handling and edge case coverage

Documentation:
- Updated README.md with AST chunking highlights
- Added ASTCHUNK_INTEGRATION.md with complete guide
- Updated features.md with new capabilities

* Refactored chunk utils

* Remove useless import

* Update README.md

* Update apps/chunking/utils.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update apps/code_rag.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix issue

* apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fixes after pr review

* Fix tests not passing

* Fix linter error for documentation files

* Update .gitignore with unwanted files

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Andy Lee <andylizf@outlook.com>
2025-08-19 23:35:31 -07:00

101 lines
1.3 KiB
Plaintext
Executable File

raw_data/
scaling_out/
scaling_out_old/
sanity_check/
demo/indices/
# .vscode/
*.log
*pycache*
outputs/
*.pkl
*.pdf
*.idx
*.map
.history/
lm_eval.egg-info/
demo/experiment_results/**/*.json
*.jsonl
*.eml
*.emlx
*.json
!.vscode/*.json
*.sh
*.txt
!CMakeLists.txt
latency_breakdown*.json
experiment_results/eval_results/diskann/*.json
aws/
.venv/
.cursor/rules/
*.egg-info/
skip_reorder_comparison/
analysis_results/
build/
.cache/
nprobe_logs/
micro/results
micro/contriever-INT8
data/*
!data/2501.14312v1 (1).pdf
!data/2506.08276v1.pdf
!data/PrideandPrejudice.txt
!data/huawei_pangu.md
!data/ground_truth/
!data/indices/
!data/queries/
!data/.gitattributes
*.qdstrm
benchmark_results/
results/
frac_*.png
final_in_*.png
embedding_comparison_results/
*.ind
*.gz
*.fvecs
*.ivecs
*.index
*.bin
*.old
read_graph
analyze_diskann_graph
degree_distribution.png
micro/degree_distribution.png
policy_results_*
results_*/
experiment_results/
.DS_Store
# The above are inherited from old Power RAG repo
# Python-generated files
__pycache__/
*.py[oc]
build/
dist/
wheels/
*.egg-info
# Virtual environments
.venv
.env
test_indices*/
test_*.py
!tests/**
packages/leann-backend-diskann/third_party/DiskANN/_deps/
*.meta.json
*.passages.json
batchtest.py
tests/__pytest_cache__/
tests/__pycache__/
CLAUDE.md
CLAUDE.local.md
.claude/*.local.*
.claude/local/*