Files

ww26 c3aceed1e0 metadata reveal for ast-chunking; smart detection of seq length in ollama; auto adjust chunk length for ast to prevent silent truncation (#157 )

* feat: enhance token limits with dynamic discovery + AST metadata

Improves upon upstream PR #154 with two major enhancements:

1. **Hybrid Token Limit Discovery**
   - Dynamic: Query Ollama /api/show for context limits
   - Fallback: Registry for LM Studio/OpenAI
   - Zero maintenance for Ollama users
   - Respects custom num_ctx settings

2. **AST Metadata Preservation**
   - create_ast_chunks() returns dict format with metadata
   - Preserves file_path, file_name, timestamps
   - Includes astchunk metadata (line numbers, node counts)
   - Fixes content extraction bug (checks "content" key)
   - Enables --show-metadata flag

3. **Better Token Limits**
   - nomic-embed-text: 2048 tokens (vs 512)
   - nomic-embed-text-v1.5: 2048 tokens
   - Added OpenAI models: 8192 tokens

4. **Comprehensive Tests**
   - 11 tests for token truncation
   - 545 new lines in test_astchunk_integration.py
   - All metadata preservation tests passing

* fix: merge EMBEDDING_MODEL_LIMITS and remove redundant validation

- Merged upstream's model list with our corrected token limits
- Kept our corrected nomic-embed-text: 2048 (not 512)
- Removed post-chunking validation (redundant with embedding-time truncation)
- All tests passing except 2 pre-existing integration test failures

* style: apply ruff formatting and restore PR #154 version handling

- Remove duplicate truncate_to_token_limit and get_model_token_limit functions
- Restore version handling logic (model:latest -> model) from PR #154
- Restore partial matching fallback for model name variations
- Apply ruff formatting to all modified files
- All 11 token truncation tests passing

* style: sort imports alphabetically (pre-commit auto-fix)

* fix: show AST token limit warning only once per session

- Add module-level flag to track if warning shown
- Prevents spam when processing multiple files
- Add clarifying note that auto-truncation happens at embedding time
- Addresses issue where warning appeared for every code file

* enhance: add detailed logging for token truncation

- Track and report truncation statistics (count, tokens removed, max length)
- Show first 3 individual truncations with exact token counts
- Provide comprehensive summary when truncation occurs
- Use WARNING level for data loss visibility
- Silent (DEBUG level only) when no truncation needed

Replaces misleading "truncated where necessary" message that appeared
even when nothing was truncated.

2025-11-08 17:37:31 -08:00

README.md

Experiments (#68 )

2025-09-24 11:19:04 -07:00

test_astchunk_integration.py

metadata reveal for ast-chunking; smart detection of seq length in ollama; auto adjust chunk length for ast to prevent silent truncation (#157 )

2025-11-08 17:37:31 -08:00

test_basic.py

feat(core,diskann): robust embedding server (no-hang) + DiskANN fast mode (graph partition) (#29 )

2025-08-14 01:02:24 -07:00

test_ci_minimal.py

refactor: Unify examples interface with BaseRAGExample (#12 )

2025-08-03 23:06:24 -07:00

test_cli_ask.py

Allow 'leann ask' to accept a positional question (#116 )

2025-09-23 21:18:57 -07:00

test_diskann_partition.py

feat(core,diskann): robust embedding server (no-hang) + DiskANN fast mode (graph partition) (#29 )

2025-08-14 01:02:24 -07:00

test_document_rag.py

Add AST-aware code chunking for better code understanding (#58 )

2025-08-19 23:35:31 -07:00

test_embedding_server_manager.py

Fix restart embedding server when passages change (#117 )

2025-09-23 22:28:36 -07:00

test_mcp_integration.py

feat: Add MCP integration support for Slack and Twitter (#134 )

2025-10-07 02:18:32 -07:00

test_mcp_standalone.py

feat: Add MCP integration support for Slack and Twitter (#134 )

2025-10-07 02:18:32 -07:00

test_metadata_filtering.py

Metadata filtering feature (#75 )

2025-08-20 19:57:56 -07:00

test_readme_examples.py

feat(core,diskann): robust embedding server (no-hang) + DiskANN fast mode (graph partition) (#29 )

2025-08-14 01:02:24 -07:00

test_token_truncation.py

metadata reveal for ast-chunking; smart detection of seq length in ollama; auto adjust chunk length for ast to prevent silent truncation (#157 )

2025-11-08 17:37:31 -08:00

README.md

LEANN Tests

This directory contains automated tests for the LEANN project using pytest.

Test Files

`test_readme_examples.py`

Tests the examples shown in README.md:

The basic example code that users see first (parametrized for both HNSW and DiskANN backends)
Import statements work correctly
Different backend options (HNSW, DiskANN)
Different LLM configuration options (parametrized for both backends)
All main README examples are tested with both HNSW and DiskANN backends using pytest parametrization

`test_basic.py`

Basic functionality tests that verify:

All packages can be imported correctly
C++ extensions (FAISS, DiskANN) load properly
Basic index building and searching works for both HNSW and DiskANN backends
Uses parametrized tests to test both backends

`test_document_rag.py`

Tests the document RAG example functionality:

Tests with facebook/contriever embeddings
Tests with OpenAI embeddings (if API key is available)
Tests error handling with invalid parameters
Verifies that normalized embeddings are detected and cosine distance is used

`test_diskann_partition.py`

Tests DiskANN graph partitioning functionality:

Tests DiskANN index building without partitioning (baseline)
Tests automatic graph partitioning with is_recompute=True
Verifies that partition files are created and large files are cleaned up for storage saving
Tests search functionality with partitioned indices
Validates medoid and max_base_norm file generation and usage
Includes performance comparison between DiskANN (with partition) and HNSW
Note: These tests are skipped in CI due to hardware requirements and computation time

Running Tests

Install test dependencies:

# Using uv dependency groups (tools only)
uv sync --only-group test

Run all tests:

pytest tests/

# Or with coverage
pytest tests/ --cov=leann --cov-report=html

# Run in parallel (faster)
pytest tests/ -n auto

Run specific tests:

# Only basic tests
pytest tests/test_basic.py

# Only tests that don't require OpenAI
pytest tests/ -m "not openai"

# Skip slow tests
pytest tests/ -m "not slow"

# Run DiskANN partition tests (requires local machine, not CI)
pytest tests/test_diskann_partition.py

Run with specific backend:

# Test only HNSW backend
pytest tests/test_basic.py::test_backend_basic[hnsw]
pytest tests/test_readme_examples.py::test_readme_basic_example[hnsw]

# Test only DiskANN backend
pytest tests/test_basic.py::test_backend_basic[diskann]
pytest tests/test_readme_examples.py::test_readme_basic_example[diskann]

# All DiskANN tests (parametrized + specialized partition tests)
pytest tests/ -k diskann

CI/CD Integration

Tests are automatically run in GitHub Actions:

After building wheel packages
On multiple Python versions (3.9 - 3.13)
On both Ubuntu and macOS
Using pytest with appropriate markers and flags

pytest.ini Configuration

The pytest.ini file configures:

Test discovery paths
Default timeout (600 seconds)
Environment variables (HF_HUB_DISABLE_SYMLINKS, TOKENIZERS_PARALLELISM)
Custom markers for slow and OpenAI tests
Verbose output with short tracebacks

Known Issues

OpenAI tests are automatically skipped if no API key is provided