* feat: enhance token limits with dynamic discovery + AST metadata Improves upon upstream PR #154 with two major enhancements: 1. **Hybrid Token Limit Discovery** - Dynamic: Query Ollama /api/show for context limits - Fallback: Registry for LM Studio/OpenAI - Zero maintenance for Ollama users - Respects custom num_ctx settings 2. **AST Metadata Preservation** - create_ast_chunks() returns dict format with metadata - Preserves file_path, file_name, timestamps - Includes astchunk metadata (line numbers, node counts) - Fixes content extraction bug (checks "content" key) - Enables --show-metadata flag 3. **Better Token Limits** - nomic-embed-text: 2048 tokens (vs 512) - nomic-embed-text-v1.5: 2048 tokens - Added OpenAI models: 8192 tokens 4. **Comprehensive Tests** - 11 tests for token truncation - 545 new lines in test_astchunk_integration.py - All metadata preservation tests passing * fix: merge EMBEDDING_MODEL_LIMITS and remove redundant validation - Merged upstream's model list with our corrected token limits - Kept our corrected nomic-embed-text: 2048 (not 512) - Removed post-chunking validation (redundant with embedding-time truncation) - All tests passing except 2 pre-existing integration test failures * style: apply ruff formatting and restore PR #154 version handling - Remove duplicate truncate_to_token_limit and get_model_token_limit functions - Restore version handling logic (model:latest -> model) from PR #154 - Restore partial matching fallback for model name variations - Apply ruff formatting to all modified files - All 11 token truncation tests passing * style: sort imports alphabetically (pre-commit auto-fix) * fix: show AST token limit warning only once per session - Add module-level flag to track if warning shown - Prevents spam when processing multiple files - Add clarifying note that auto-truncation happens at embedding time - Addresses issue where warning appeared for every code file * enhance: add detailed logging for token truncation - Track and report truncation statistics (count, tokens removed, max length) - Show first 3 individual truncations with exact token counts - Provide comprehensive summary when truncation occurs - Use WARNING level for data loss visibility - Silent (DEBUG level only) when no truncation needed Replaces misleading "truncated where necessary" message that appeared even when nothing was truncated.
48 lines
1.4 KiB
Python
48 lines
1.4 KiB
Python
"""Unified chunking utilities facade.
|
|
|
|
This module re-exports the packaged utilities from `leann.chunking_utils` so
|
|
that both repo apps (importing `chunking`) and installed wheels share one
|
|
single implementation. When running from the repo without installation, it
|
|
adds the `packages/leann-core/src` directory to `sys.path` as a fallback.
|
|
"""
|
|
|
|
import sys
|
|
from pathlib import Path
|
|
|
|
try:
|
|
from leann.chunking_utils import (
|
|
CODE_EXTENSIONS,
|
|
_traditional_chunks_as_dicts,
|
|
create_ast_chunks,
|
|
create_text_chunks,
|
|
create_traditional_chunks,
|
|
detect_code_files,
|
|
get_language_from_extension,
|
|
)
|
|
except Exception: # pragma: no cover - best-effort fallback for dev environment
|
|
repo_root = Path(__file__).resolve().parents[2]
|
|
leann_src = repo_root / "packages" / "leann-core" / "src"
|
|
if leann_src.exists():
|
|
sys.path.insert(0, str(leann_src))
|
|
from leann.chunking_utils import (
|
|
CODE_EXTENSIONS,
|
|
_traditional_chunks_as_dicts,
|
|
create_ast_chunks,
|
|
create_text_chunks,
|
|
create_traditional_chunks,
|
|
detect_code_files,
|
|
get_language_from_extension,
|
|
)
|
|
else:
|
|
raise
|
|
|
|
__all__ = [
|
|
"CODE_EXTENSIONS",
|
|
"_traditional_chunks_as_dicts",
|
|
"create_ast_chunks",
|
|
"create_text_chunks",
|
|
"create_traditional_chunks",
|
|
"detect_code_files",
|
|
"get_language_from_extension",
|
|
]
|