* feat: enhance token limits with dynamic discovery + AST metadata
Improves upon upstream PR #154 with two major enhancements:
1. **Hybrid Token Limit Discovery**
- Dynamic: Query Ollama /api/show for context limits
- Fallback: Registry for LM Studio/OpenAI
- Zero maintenance for Ollama users
- Respects custom num_ctx settings
2. **AST Metadata Preservation**
- create_ast_chunks() returns dict format with metadata
- Preserves file_path, file_name, timestamps
- Includes astchunk metadata (line numbers, node counts)
- Fixes content extraction bug (checks "content" key)
- Enables --show-metadata flag
3. **Better Token Limits**
- nomic-embed-text: 2048 tokens (vs 512)
- nomic-embed-text-v1.5: 2048 tokens
- Added OpenAI models: 8192 tokens
4. **Comprehensive Tests**
- 11 tests for token truncation
- 545 new lines in test_astchunk_integration.py
- All metadata preservation tests passing
* fix: merge EMBEDDING_MODEL_LIMITS and remove redundant validation
- Merged upstream's model list with our corrected token limits
- Kept our corrected nomic-embed-text: 2048 (not 512)
- Removed post-chunking validation (redundant with embedding-time truncation)
- All tests passing except 2 pre-existing integration test failures
* style: apply ruff formatting and restore PR #154 version handling
- Remove duplicate truncate_to_token_limit and get_model_token_limit functions
- Restore version handling logic (model:latest -> model) from PR #154
- Restore partial matching fallback for model name variations
- Apply ruff formatting to all modified files
- All 11 token truncation tests passing
* style: sort imports alphabetically (pre-commit auto-fix)
* fix: show AST token limit warning only once per session
- Add module-level flag to track if warning shown
- Prevents spam when processing multiple files
- Add clarifying note that auto-truncation happens at embedding time
- Addresses issue where warning appeared for every code file
* enhance: add detailed logging for token truncation
- Track and report truncation statistics (count, tokens removed, max length)
- Show first 3 individual truncations with exact token counts
- Provide comprehensive summary when truncation occurs
- Use WARNING level for data loss visibility
- Silent (DEBUG level only) when no truncation needed
Replaces misleading "truncated where necessary" message that appeared
even when nothing was truncated.
* feat(core): Add AST-aware code chunking with astchunk integration
This PR introduces intelligent code chunking that preserves semantic boundaries
(functions, classes, methods) for better code understanding in RAG applications.
Key Features:
- AST-aware chunking for Python, Java, C#, TypeScript files
- Graceful fallback to traditional chunking for unsupported languages
- New specialized code RAG application for repositories
- Enhanced CLI with --use-ast-chunking flag
- Comprehensive test suite with integration tests
Technical Implementation:
- New chunking_utils.py module with enhanced chunking logic
- Extended base RAG framework with AST chunking arguments
- Updated document RAG with --enable-code-chunking flag
- CLI integration with proper error handling and fallback
Benefits:
- Better semantic understanding of code structure
- Improved search quality for code-related queries
- Maintains backward compatibility with existing workflows
- Supports mixed content (code + documentation) seamlessly
Dependencies:
- Added astchunk and tree-sitter parsers to pyproject.toml
- All dependencies are optional - fallback works without them
Testing:
- Comprehensive test suite in test_astchunk_integration.py
- Integration tests with document RAG
- Error handling and edge case coverage
Documentation:
- Updated README.md with AST chunking highlights
- Added ASTCHUNK_INTEGRATION.md with complete guide
- Updated features.md with new capabilities
* Refactored chunk utils
* Remove useless import
* Update README.md
* Update apps/chunking/utils.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update apps/code_rag.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Fix issue
* apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Fixes after pr review
* Fix tests not passing
* Fix linter error for documentation files
* Update .gitignore with unwanted files
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Andy Lee <andylizf@outlook.com>