Commit Graph

3 Commits

Author SHA1 Message Date
ww26
c3aceed1e0 metadata reveal for ast-chunking; smart detection of seq length in ollama; auto adjust chunk length for ast to prevent silent truncation (#157)
* feat: enhance token limits with dynamic discovery + AST metadata

Improves upon upstream PR #154 with two major enhancements:

1. **Hybrid Token Limit Discovery**
   - Dynamic: Query Ollama /api/show for context limits
   - Fallback: Registry for LM Studio/OpenAI
   - Zero maintenance for Ollama users
   - Respects custom num_ctx settings

2. **AST Metadata Preservation**
   - create_ast_chunks() returns dict format with metadata
   - Preserves file_path, file_name, timestamps
   - Includes astchunk metadata (line numbers, node counts)
   - Fixes content extraction bug (checks "content" key)
   - Enables --show-metadata flag

3. **Better Token Limits**
   - nomic-embed-text: 2048 tokens (vs 512)
   - nomic-embed-text-v1.5: 2048 tokens
   - Added OpenAI models: 8192 tokens

4. **Comprehensive Tests**
   - 11 tests for token truncation
   - 545 new lines in test_astchunk_integration.py
   - All metadata preservation tests passing

* fix: merge EMBEDDING_MODEL_LIMITS and remove redundant validation

- Merged upstream's model list with our corrected token limits
- Kept our corrected nomic-embed-text: 2048 (not 512)
- Removed post-chunking validation (redundant with embedding-time truncation)
- All tests passing except 2 pre-existing integration test failures

* style: apply ruff formatting and restore PR #154 version handling

- Remove duplicate truncate_to_token_limit and get_model_token_limit functions
- Restore version handling logic (model:latest -> model) from PR #154
- Restore partial matching fallback for model name variations
- Apply ruff formatting to all modified files
- All 11 token truncation tests passing

* style: sort imports alphabetically (pre-commit auto-fix)

* fix: show AST token limit warning only once per session

- Add module-level flag to track if warning shown
- Prevents spam when processing multiple files
- Add clarifying note that auto-truncation happens at embedding time
- Addresses issue where warning appeared for every code file

* enhance: add detailed logging for token truncation

- Track and report truncation statistics (count, tokens removed, max length)
- Show first 3 individual truncations with exact token counts
- Provide comprehensive summary when truncation occurs
- Use WARNING level for data loss visibility
- Silent (DEBUG level only) when no truncation needed

Replaces misleading "truncated where necessary" message that appeared
even when nothing was truncated.
2025-11-08 17:37:31 -08:00
Andy Lee
e93c0dec6f [Fix] Enable AST chunking when installed (package chunking utils) (#101)
* fix(core): package chunking utils for AST chunking; re-export in apps; CLI imports packaged utils

* style

* chore: fix ruff warnings (RUF059, F401)

* style
2025-09-17 18:44:00 -07:00
Gabriel Dehan
13bb561aad Add AST-aware code chunking for better code understanding (#58)
* feat(core): Add AST-aware code chunking with astchunk integration

This PR introduces intelligent code chunking that preserves semantic boundaries
(functions, classes, methods) for better code understanding in RAG applications.

Key Features:
- AST-aware chunking for Python, Java, C#, TypeScript files
- Graceful fallback to traditional chunking for unsupported languages
- New specialized code RAG application for repositories
- Enhanced CLI with --use-ast-chunking flag
- Comprehensive test suite with integration tests

Technical Implementation:
- New chunking_utils.py module with enhanced chunking logic
- Extended base RAG framework with AST chunking arguments
- Updated document RAG with --enable-code-chunking flag
- CLI integration with proper error handling and fallback

Benefits:
- Better semantic understanding of code structure
- Improved search quality for code-related queries
- Maintains backward compatibility with existing workflows
- Supports mixed content (code + documentation) seamlessly

Dependencies:
- Added astchunk and tree-sitter parsers to pyproject.toml
- All dependencies are optional - fallback works without them

Testing:
- Comprehensive test suite in test_astchunk_integration.py
- Integration tests with document RAG
- Error handling and edge case coverage

Documentation:
- Updated README.md with AST chunking highlights
- Added ASTCHUNK_INTEGRATION.md with complete guide
- Updated features.md with new capabilities

* Refactored chunk utils

* Remove useless import

* Update README.md

* Update apps/chunking/utils.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update apps/code_rag.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fix issue

* apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Fixes after pr review

* Fix tests not passing

* Fix linter error for documentation files

* Update .gitignore with unwanted files

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Andy Lee <andylizf@outlook.com>
2025-08-19 23:35:31 -07:00