* feat(core): Add AST-aware code chunking with astchunk integration This PR introduces intelligent code chunking that preserves semantic boundaries (functions, classes, methods) for better code understanding in RAG applications. Key Features: - AST-aware chunking for Python, Java, C#, TypeScript files - Graceful fallback to traditional chunking for unsupported languages - New specialized code RAG application for repositories - Enhanced CLI with --use-ast-chunking flag - Comprehensive test suite with integration tests Technical Implementation: - New chunking_utils.py module with enhanced chunking logic - Extended base RAG framework with AST chunking arguments - Updated document RAG with --enable-code-chunking flag - CLI integration with proper error handling and fallback Benefits: - Better semantic understanding of code structure - Improved search quality for code-related queries - Maintains backward compatibility with existing workflows - Supports mixed content (code + documentation) seamlessly Dependencies: - Added astchunk and tree-sitter parsers to pyproject.toml - All dependencies are optional - fallback works without them Testing: - Comprehensive test suite in test_astchunk_integration.py - Integration tests with document RAG - Error handling and edge case coverage Documentation: - Updated README.md with AST chunking highlights - Added ASTCHUNK_INTEGRATION.md with complete guide - Updated features.md with new capabilities * Refactored chunk utils * Remove useless import * Update README.md * Update apps/chunking/utils.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update apps/code_rag.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix issue * apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fixes after pr review * Fix tests not passing * Fix linter error for documentation files * Update .gitignore with unwanted files --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Andy Lee <andylizf@outlook.com>
101 lines
1.3 KiB
Plaintext
Executable File
101 lines
1.3 KiB
Plaintext
Executable File
raw_data/
|
|
scaling_out/
|
|
scaling_out_old/
|
|
sanity_check/
|
|
demo/indices/
|
|
# .vscode/
|
|
*.log
|
|
*pycache*
|
|
outputs/
|
|
*.pkl
|
|
*.pdf
|
|
*.idx
|
|
*.map
|
|
.history/
|
|
lm_eval.egg-info/
|
|
demo/experiment_results/**/*.json
|
|
*.jsonl
|
|
*.eml
|
|
*.emlx
|
|
*.json
|
|
!.vscode/*.json
|
|
*.sh
|
|
*.txt
|
|
!CMakeLists.txt
|
|
latency_breakdown*.json
|
|
experiment_results/eval_results/diskann/*.json
|
|
aws/
|
|
.venv/
|
|
.cursor/rules/
|
|
*.egg-info/
|
|
skip_reorder_comparison/
|
|
analysis_results/
|
|
build/
|
|
.cache/
|
|
nprobe_logs/
|
|
micro/results
|
|
micro/contriever-INT8
|
|
data/*
|
|
!data/2501.14312v1 (1).pdf
|
|
!data/2506.08276v1.pdf
|
|
!data/PrideandPrejudice.txt
|
|
!data/huawei_pangu.md
|
|
!data/ground_truth/
|
|
!data/indices/
|
|
!data/queries/
|
|
!data/.gitattributes
|
|
*.qdstrm
|
|
benchmark_results/
|
|
results/
|
|
frac_*.png
|
|
final_in_*.png
|
|
embedding_comparison_results/
|
|
*.ind
|
|
*.gz
|
|
*.fvecs
|
|
*.ivecs
|
|
*.index
|
|
*.bin
|
|
*.old
|
|
|
|
read_graph
|
|
analyze_diskann_graph
|
|
degree_distribution.png
|
|
micro/degree_distribution.png
|
|
|
|
policy_results_*
|
|
results_*/
|
|
experiment_results/
|
|
.DS_Store
|
|
|
|
# The above are inherited from old Power RAG repo
|
|
|
|
# Python-generated files
|
|
__pycache__/
|
|
*.py[oc]
|
|
build/
|
|
dist/
|
|
wheels/
|
|
*.egg-info
|
|
|
|
# Virtual environments
|
|
.venv
|
|
.env
|
|
|
|
test_indices*/
|
|
test_*.py
|
|
!tests/**
|
|
packages/leann-backend-diskann/third_party/DiskANN/_deps/
|
|
|
|
*.meta.json
|
|
*.passages.json
|
|
|
|
batchtest.py
|
|
tests/__pytest_cache__/
|
|
tests/__pycache__/
|
|
|
|
CLAUDE.md
|
|
CLAUDE.local.md
|
|
.claude/*.local.*
|
|
.claude/local/*
|