Compare commits

...

10 Commits

Author SHA1 Message Date
Andy Lee
679848a3b7 simplify: Make templates more concise and user-friendly 2025-09-19 13:51:15 -07:00
Andy Lee
da811061f4 feat: Add GitHub PR and issue templates for better contributor experience 2025-09-19 11:56:40 -07:00
Andy Lee
e93c0dec6f [Fix] Enable AST chunking when installed (package chunking utils) (#101)
* fix(core): package chunking utils for AST chunking; re-export in apps; CLI imports packaged utils

* style

* chore: fix ruff warnings (RUF059, F401)

* style
2025-09-17 18:44:00 -07:00
GitHub Actions
c5a29f849a chore: release v0.3.4 2025-09-16 20:45:22 +00:00
Yichuan Wang
3b8dc6368e Ast fork (#92) 2025-09-08 18:43:31 -07:00
Aiden Huang
e309f292de docs(mcp): add root llms.txt for MCP discovery; update MCP README to reference it; refs #76 (#91) 2025-09-07 14:39:58 -07:00
AWS Mcleod
0d9f92ea0f Add grep search functionality - Issue #86 (#87)
* Add grep search functionality to LeannSearcher

- Add use_grep parameter to search method
- Implement grep-based search on .jsonl files
- Add fallback Python regex search
- Support same SearchResult format as semantic search

Addresses issue #86

* fix: resolve linting errors

* docs: add grep search example

* docs: add grep search to README examples

* refactor: remove regex fallback, move grep example to features section

* docs: add grep search to Advanced Features with comprehensive guide
2025-09-05 13:48:07 -07:00
GitHub Actions
b0b353d279 chore: release v0.3.3 2025-09-02 21:29:56 +00:00
Andy Lee
4dffdfedbe feat: Add ARM64 Linux wheel support for leann-backend-hnsw (#83)
* feat: Add ARM64 Linux wheel support for leann-backend-hnsw

* fix: Use OpenBLAS for ARM64 Linux builds instead of Intel MKL

* fix: Configure Faiss with SVE optimization for ARM64 builds

- Set FAISS_OPT_LEVEL to "sve" for ARM64 architecture
- Disable x86-specific SIMD instructions (AVX2, AVX512, SSE4.1)
- Use ARM64-native SVE optimization as per Faiss conda build scripts
- Add architecture detection and proper configuration messages

Fixes compilation error: "xmmintrin.h: No such file or directory"
on ubuntu-24.04-arm runners.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Apply ARM64 compatibility fix directly to Faiss submodule

- Modify faiss/impl/pq.cpp to use x86-specific preprocessor conditions
- Remove patch file approach in favor of direct submodule modification
- Update CMakeLists.txt to reflect the submodule changes
- Fixes ARM64 Linux compilation by preventing x86 SIMD header inclusion

This resolves the "xmmintrin.h: No such file or directory" error
when building ARM64 Linux wheels for Docker compatibility.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* chore: Update Faiss submodule to include ARM64 compatibility fix

- Points to commit ed96ff7d with x86-specific preprocessor conditions
- Enables successful ARM64 Linux wheel builds

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* retrigger ci

* fix: Use different optimization levels for ARM64 based on platform

- Use SVE optimization only for ARM64 Linux
- Use generic optimization for ARM64 macOS to avoid clang SVE issues
- Fixes macOS ARM64 compilation errors with SVE instructions

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* feat: Update DiskANN submodule with OpenBLAS fallback support

- Points to commit 5c396c4 with ARM64 Linux OpenBLAS support
- Enables DiskANN to build on ARM64 Linux using standard BLAS libraries
- Resolves Intel MKL dependency issues for Docker ARM64 deployments

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Update DiskANN submodule with ZeroMQ polling configuration

- Points to commit 3a1016e with explicit polling method setup
- Resolves ZeroMQ autodetection issues on ARM64 Linux
- Ensures stable cross-platform ZeroMQ builds

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* retrigger ci

* fix: Update DiskANN submodule with ARM64 compiler flags fix

- Points to commit a0dc600 with architecture-specific compiler flags
- Removes x86 SIMD flags on ARM64 Linux to fix compilation errors
- Enables successful ARM64 Linux wheel builds

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Update DiskANN submodule with ARM64 compiler flags fix

- Points to commit 0921664 with architecture-specific compiler flags
- Removes x86 SIMD flags on ARM64 Linux to fix compilation errors
- Enables successful ARM64 Linux wheel builds

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* retrigger ci

* fix: Update DiskANN submodule with cross-platform prefetch support

- Points to commit 39192d6 with unified prefetch macros
- Replaces all Intel-specific _mm_prefetch calls with cross-platform macros
- Enables ARM64 Linux compatibility while maintaining x86 performance
- Resolves all remaining compilation errors for ARM64 builds

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Update DiskANN submodule with corrected ARM64 compatibility fixes

- Points to commit 3cb87a8 with proper x86 platform detection
- Includes ARM64 fallback for AVXDistanceInnerProductFloat function
- Resolves all remaining '__m256 was not declared' compilation errors
- Enables successful ARM64 Linux wheel builds for Docker compatibility

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Update DiskANN submodule with template type handling fix

- Points to commit d396bc3 with corrected template type handling
- Fixes DistanceInnerProduct template instantiation for int8_t/uint8_t types
- Resolves 'cannot convert const signed char* to const float*' error
- Completes ARM64 Linux compilation compatibility

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Update DiskANN submodule with DistanceFastL2::norm template fix

- Points to commit 69d9a99 with corrected template type handling
- Fixes DistanceFastL2::norm template instantiation for int8_t/uint8_t types
- Resolves another 'cannot convert const signed char* to const float*' error
- Continues ARM64 Linux compilation compatibility improvements

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Update DiskANN submodule with LAPACKE header detection

- Points to commit 64a9e01 with LAPACKE header path configuration
- Adds pkg-config based detection for LAPACKE include directories
- Resolves 'lapacke.h: No such file or directory' compilation error
- Completes OpenBLAS integration for ARM64 Linux builds

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Update DiskANN submodule with enhanced LAPACKE header detection

- Points to commit 18d0721 with fallback LAPACKE header search paths
- Checks multiple standard locations for lapacke.h on various systems
- Improves ARM64 Linux compatibility for OpenBLAS builds
- Should resolve 'lapacke.h: No such file or directory' errors

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Add liblapacke-dev package for ARM64 Linux builds

- Add liblapacke-dev to ARM64 dependencies alongside libopenblas-dev
- Provides lapacke.h header file needed for LAPACK C interface
- Fixes 'lapacke.h: No such file or directory' compilation error
- Enables complete OpenBLAS + LAPACKE support for ARM64 wheel builds

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Update DiskANN submodule with cosine_similarity.h x86 intrinsics fix

- Points to commit dbb17eb with corrected conditional compilation
- Fixes immintrin.h inclusion for ARM64 compatibility in cosine_similarity.h
- Resolves 'immintrin.h: No such file or directory' error
- Continues systematic ARM64 Linux compilation fixes

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Update DiskANN submodule with LAPACKE library linking fix

- Points to commit 19f9603 with explicit LAPACKE library discovery and linking
- Resolves 'undefined symbol: LAPACKE_sgesdd' runtime error on ARM64 Linux
- Completes ARM64 Linux wheel build compatibility for Docker deployments

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-09-02 14:27:06 -07:00
Yichuan Wang
d41e467df9 [CLI] More robust leann list and leann build (#84)
* chore(submodule): bump faiss to latest storage-efficient build

* [chore] add slack to share use case

* [cli] better gitignore / better leann list

* [cli] fix # 81
2025-09-01 18:36:27 -07:00
28 changed files with 675 additions and 185 deletions

50
.github/ISSUE_TEMPLATE/bug_report.yml vendored Normal file
View File

@@ -0,0 +1,50 @@
name: Bug Report
description: Report a bug in LEANN
labels: ["bug"]
body:
- type: textarea
id: description
attributes:
label: What happened?
description: A clear description of the bug
validations:
required: true
- type: textarea
id: reproduce
attributes:
label: How to reproduce
placeholder: |
1. Install with...
2. Run command...
3. See error
validations:
required: true
- type: textarea
id: error
attributes:
label: Error message
description: Paste any error messages
render: shell
- type: input
id: version
attributes:
label: LEANN Version
placeholder: "0.1.0"
validations:
required: true
- type: dropdown
id: os
attributes:
label: Operating System
options:
- macOS
- Linux
- Windows
- Docker
validations:
required: true

8
.github/ISSUE_TEMPLATE/config.yml vendored Normal file
View File

@@ -0,0 +1,8 @@
blank_issues_enabled: true
contact_links:
- name: Documentation
url: https://github.com/LEANN-RAG/LEANN-RAG/tree/main/docs
about: Read the docs first
- name: Discussions
url: https://github.com/LEANN-RAG/LEANN-RAG/discussions
about: Ask questions and share ideas

View File

@@ -0,0 +1,27 @@
name: Feature Request
description: Suggest a new feature for LEANN
labels: ["enhancement"]
body:
- type: textarea
id: problem
attributes:
label: What problem does this solve?
description: Describe the problem or need
validations:
required: true
- type: textarea
id: solution
attributes:
label: Proposed solution
description: How would you like this to work?
validations:
required: true
- type: textarea
id: example
attributes:
label: Example usage
description: Show how the API might look
render: python

13
.github/pull_request_template.md vendored Normal file
View File

@@ -0,0 +1,13 @@
## What does this PR do?
<!-- Brief description of your changes -->
## Related Issues
Fixes #
## Checklist
- [ ] Tests pass (`uv run pytest`)
- [ ] Code formatted (`ruff format` and `ruff check`)
- [ ] Pre-commit hooks pass (`pre-commit run --all-files`)

View File

@@ -54,6 +54,17 @@ jobs:
python: '3.12' python: '3.12'
- os: ubuntu-22.04 - os: ubuntu-22.04
python: '3.13' python: '3.13'
# ARM64 Linux builds
- os: ubuntu-24.04-arm
python: '3.9'
- os: ubuntu-24.04-arm
python: '3.10'
- os: ubuntu-24.04-arm
python: '3.11'
- os: ubuntu-24.04-arm
python: '3.12'
- os: ubuntu-24.04-arm
python: '3.13'
- os: macos-14 - os: macos-14
python: '3.9' python: '3.9'
- os: macos-14 - os: macos-14
@@ -108,13 +119,46 @@ jobs:
pkg-config libabsl-dev libaio-dev libprotobuf-dev \ pkg-config libabsl-dev libaio-dev libprotobuf-dev \
patchelf patchelf
# Install Intel MKL for DiskANN # Debug: Show system information
wget -q https://registrationcenter-download.intel.com/akdlm/IRC_NAS/79153e0f-74d7-45af-b8c2-258941adf58a/intel-onemkl-2025.0.0.940.sh echo "🔍 System Information:"
sudo sh intel-onemkl-2025.0.0.940.sh -a --components intel.oneapi.lin.mkl.devel --action install --eula accept -s echo "Architecture: $(uname -m)"
source /opt/intel/oneapi/setvars.sh echo "OS: $(uname -a)"
echo "MKLROOT=/opt/intel/oneapi/mkl/latest" >> $GITHUB_ENV echo "CPU info: $(lscpu | head -5)"
echo "LD_LIBRARY_PATH=/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin" >> $GITHUB_ENV
echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/oneapi/mkl/latest/lib/intel64" >> $GITHUB_ENV # Install math library based on architecture
ARCH=$(uname -m)
echo "🔍 Setting up math library for architecture: $ARCH"
if [[ "$ARCH" == "x86_64" ]]; then
# Install Intel MKL for DiskANN on x86_64
echo "📦 Installing Intel MKL for x86_64..."
wget -q https://registrationcenter-download.intel.com/akdlm/IRC_NAS/79153e0f-74d7-45af-b8c2-258941adf58a/intel-onemkl-2025.0.0.940.sh
sudo sh intel-onemkl-2025.0.0.940.sh -a --components intel.oneapi.lin.mkl.devel --action install --eula accept -s
source /opt/intel/oneapi/setvars.sh
echo "MKLROOT=/opt/intel/oneapi/mkl/latest" >> $GITHUB_ENV
echo "LD_LIBRARY_PATH=/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin" >> $GITHUB_ENV
echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/oneapi/mkl/latest/lib/intel64" >> $GITHUB_ENV
echo "✅ Intel MKL installed for x86_64"
# Debug: Check MKL installation
echo "🔍 MKL Installation Check:"
ls -la /opt/intel/oneapi/mkl/latest/ || echo "MKL directory not found"
ls -la /opt/intel/oneapi/mkl/latest/lib/ || echo "MKL lib directory not found"
elif [[ "$ARCH" == "aarch64" ]]; then
# Use OpenBLAS for ARM64 (MKL installer not compatible with ARM64)
echo "📦 Installing OpenBLAS for ARM64..."
sudo apt-get install -y libopenblas-dev liblapack-dev liblapacke-dev
echo "✅ OpenBLAS installed for ARM64"
# Debug: Check OpenBLAS installation
echo "🔍 OpenBLAS Installation Check:"
dpkg -l | grep openblas || echo "OpenBLAS package not found"
ls -la /usr/lib/aarch64-linux-gnu/openblas/ || echo "OpenBLAS directory not found"
fi
# Debug: Show final library paths
echo "🔍 Final LD_LIBRARY_PATH: $LD_LIBRARY_PATH"
- name: Install system dependencies (macOS) - name: Install system dependencies (macOS)
if: runner.os == 'macOS' if: runner.os == 'macOS'

1
.gitignore vendored
View File

@@ -22,6 +22,7 @@ demo/experiment_results/**/*.json
*.sh *.sh
*.txt *.txt
!CMakeLists.txt !CMakeLists.txt
!llms.txt
latency_breakdown*.json latency_breakdown*.json
experiment_results/eval_results/diskann/*.json experiment_results/eval_results/diskann/*.json
aws/ aws/

4
.gitmodules vendored
View File

@@ -14,3 +14,7 @@
[submodule "packages/leann-backend-hnsw/third_party/libzmq"] [submodule "packages/leann-backend-hnsw/third_party/libzmq"]
path = packages/leann-backend-hnsw/third_party/libzmq path = packages/leann-backend-hnsw/third_party/libzmq
url = https://github.com/zeromq/libzmq.git url = https://github.com/zeromq/libzmq.git
[submodule "packages/astchunk-leann"]
path = packages/astchunk-leann
url = git@github.com:yichuan-w/astchunk-leann.git
branch = main

View File

@@ -656,6 +656,19 @@ results = searcher.search(
📖 **[Complete Metadata filtering guide →](docs/metadata_filtering.md)** 📖 **[Complete Metadata filtering guide →](docs/metadata_filtering.md)**
### 🔍 Grep Search
For exact text matching instead of semantic search, use the `use_grep` parameter:
```python
# Exact text search
results = searcher.search("bananacrocodile", use_grep=True, top_k=1)
```
**Use cases**: Finding specific code patterns, error messages, function names, or exact phrases where semantic similarity isn't needed.
📖 **[Complete grep search guide →](docs/grep_search.md)**
## 🏗️ Architecture & How It Works ## 🏗️ Architecture & How It Works
<p align="center"> <p align="center">

View File

@@ -1,16 +1,38 @@
""" """Unified chunking utilities facade.
Chunking utilities for LEANN RAG applications.
Provides AST-aware and traditional text chunking functionality. This module re-exports the packaged utilities from `leann.chunking_utils` so
that both repo apps (importing `chunking`) and installed wheels share one
single implementation. When running from the repo without installation, it
adds the `packages/leann-core/src` directory to `sys.path` as a fallback.
""" """
from .utils import ( import sys
CODE_EXTENSIONS, from pathlib import Path
create_ast_chunks,
create_text_chunks, try:
create_traditional_chunks, from leann.chunking_utils import (
detect_code_files, CODE_EXTENSIONS,
get_language_from_extension, create_ast_chunks,
) create_text_chunks,
create_traditional_chunks,
detect_code_files,
get_language_from_extension,
)
except Exception: # pragma: no cover - best-effort fallback for dev environment
repo_root = Path(__file__).resolve().parents[2]
leann_src = repo_root / "packages" / "leann-core" / "src"
if leann_src.exists():
sys.path.insert(0, str(leann_src))
from leann.chunking_utils import (
CODE_EXTENSIONS,
create_ast_chunks,
create_text_chunks,
create_traditional_chunks,
detect_code_files,
get_language_from_extension,
)
else:
raise
__all__ = [ __all__ = [
"CODE_EXTENSIONS", "CODE_EXTENSIONS",

View File

@@ -74,7 +74,7 @@ class ChromeHistoryReader(BaseReader):
if count >= max_count and max_count > 0: if count >= max_count and max_count > 0:
break break
last_visit, url, title, visit_count, typed_count, hidden = row last_visit, url, title, visit_count, typed_count, _hidden = row
# Create document content with metadata embedded in text # Create document content with metadata embedded in text
doc_content = f""" doc_content = f"""

View File

@@ -26,6 +26,21 @@ leann build my-code-index --docs ./src --use-ast-chunking
uv pip install -e "." uv pip install -e "."
``` ```
#### For normal users (PyPI install)
- Use `pip install leann` or `uv pip install leann`.
- `astchunk` is pulled automatically from PyPI as a dependency; no extra steps.
#### For developers (from source, editable)
```bash
git clone https://github.com/yichuan-w/LEANN.git leann
cd leann
git submodule update --init --recursive
uv sync
```
- This repo vendors `astchunk` as a git submodule at `packages/astchunk-leann` (our fork).
- `[tool.uv.sources]` maps the `astchunk` package to that path in editable mode.
- You can edit code under `packages/astchunk-leann` and Python will use your changes immediately (no separate `pip install astchunk` needed).
## Best Practices ## Best Practices
### When to Use AST Chunking ### When to Use AST Chunking

149
docs/grep_search.md Normal file
View File

@@ -0,0 +1,149 @@
# LEANN Grep Search Usage Guide
## Overview
LEANN's grep search functionality provides exact text matching for finding specific code patterns, error messages, function names, or exact phrases in your indexed documents.
## Basic Usage
### Simple Grep Search
```python
from leann.api import LeannSearcher
searcher = LeannSearcher("your_index_path")
# Exact text search
results = searcher.search("def authenticate_user", use_grep=True, top_k=5)
for result in results:
print(f"Score: {result.score}")
print(f"Text: {result.text[:100]}...")
print("-" * 40)
```
### Comparison: Semantic vs Grep Search
```python
# Semantic search - finds conceptually similar content
semantic_results = searcher.search("machine learning algorithms", top_k=3)
# Grep search - finds exact text matches
grep_results = searcher.search("def train_model", use_grep=True, top_k=3)
```
## When to Use Grep Search
### Use Cases
- **Code Search**: Finding specific function definitions, class names, or variable references
- **Error Debugging**: Locating exact error messages or stack traces
- **Documentation**: Finding specific API endpoints or exact terminology
### Examples
```python
# Find function definitions
functions = searcher.search("def __init__", use_grep=True)
# Find import statements
imports = searcher.search("from sklearn import", use_grep=True)
# Find specific error types
errors = searcher.search("FileNotFoundError", use_grep=True)
# Find TODO comments
todos = searcher.search("TODO:", use_grep=True)
# Find configuration entries
configs = searcher.search("server_port=", use_grep=True)
```
## Technical Details
### How It Works
1. **File Location**: Grep search operates on the raw text stored in `.jsonl` files
2. **Command Execution**: Uses the system `grep` command with case-insensitive search
3. **Result Processing**: Parses JSON lines and extracts text and metadata
4. **Scoring**: Simple frequency-based scoring based on query term occurrences
### Search Process
```
Query: "def train_model"
grep -i -n "def train_model" documents.leann.passages.jsonl
Parse matching JSON lines
Calculate scores based on term frequency
Return top_k results
```
### Scoring Algorithm
```python
# Term frequency in document
score = text.lower().count(query.lower())
```
Results are ranked by score (highest first), with higher scores indicating more occurrences of the search term.
## Error Handling
### Common Issues
#### Grep Command Not Found
```
RuntimeError: grep command not found. Please install grep or use semantic search.
```
**Solution**: Install grep on your system:
- **Ubuntu/Debian**: `sudo apt-get install grep`
- **macOS**: grep is pre-installed
- **Windows**: Use WSL or install grep via Git Bash/MSYS2
#### No Results Found
```python
# Check if your query exists in the raw data
results = searcher.search("your_query", use_grep=True)
if not results:
print("No exact matches found. Try:")
print("1. Check spelling and case")
print("2. Use partial terms")
print("3. Switch to semantic search")
```
## Complete Example
```python
#!/usr/bin/env python3
"""
Grep Search Example
Demonstrates grep search for exact text matching.
"""
from leann.api import LeannSearcher
def demonstrate_grep_search():
# Initialize searcher
searcher = LeannSearcher("my_index")
print("=== Function Search ===")
functions = searcher.search("def __init__", use_grep=True, top_k=5)
for i, result in enumerate(functions, 1):
print(f"{i}. Score: {result.score}")
print(f" Preview: {result.text[:60]}...")
print()
print("=== Error Search ===")
errors = searcher.search("FileNotFoundError", use_grep=True, top_k=3)
for result in errors:
print(f"Content: {result.text.strip()}")
print("-" * 40)
if __name__ == "__main__":
demonstrate_grep_search()
```

View File

@@ -0,0 +1,35 @@
"""
Grep Search Example
Shows how to use grep-based text search instead of semantic search.
Useful when you need exact text matches rather than meaning-based results.
"""
from leann import LeannSearcher
# Load your index
searcher = LeannSearcher("my-documents.leann")
# Regular semantic search
print("=== Semantic Search ===")
results = searcher.search("machine learning algorithms", top_k=3)
for result in results:
print(f"Score: {result.score:.3f}")
print(f"Text: {result.text[:80]}...")
print()
# Grep-based search for exact text matches
print("=== Grep Search ===")
results = searcher.search("def train_model", top_k=3, use_grep=True)
for result in results:
print(f"Score: {result.score}")
print(f"Text: {result.text[:80]}...")
print()
# Find specific error messages
error_results = searcher.search("FileNotFoundError", use_grep=True)
print(f"Found {len(error_results)} files mentioning FileNotFoundError")
# Search for function definitions
func_results = searcher.search("class SearchResult", use_grep=True, top_k=5)
print(f"Found {len(func_results)} class definitions")

28
llms.txt Normal file
View File

@@ -0,0 +1,28 @@
# llms.txt — LEANN MCP and Agent Integration
product: LEANN
homepage: https://github.com/yichuan-w/LEANN
contact: https://github.com/yichuan-w/LEANN/issues
# Installation
install: uv tool install leann-core --with leann
# MCP Server Entry Point
mcp.server: leann_mcp
mcp.protocol_version: 2024-11-05
# Tools
mcp.tools: leann_list, leann_search
mcp.tool.leann_list.description: List available LEANN indexes
mcp.tool.leann_list.input: {}
mcp.tool.leann_search.description: Semantic search across a named LEANN index
mcp.tool.leann_search.input.index_name: string, required
mcp.tool.leann_search.input.query: string, required
mcp.tool.leann_search.input.top_k: integer, optional, default=5, min=1, max=20
mcp.tool.leann_search.input.complexity: integer, optional, default=32, min=16, max=128
# Notes
note: Build indexes with `leann build <name> --docs <files...>` before searching.
example.add: claude mcp add --scope user leann-server -- leann_mcp
example.verify: claude mcp list | cat

View File

@@ -4,8 +4,8 @@ build-backend = "scikit_build_core.build"
[project] [project]
name = "leann-backend-diskann" name = "leann-backend-diskann"
version = "0.3.2" version = "0.3.4"
dependencies = ["leann-core==0.3.2", "numpy", "protobuf>=3.19.0"] dependencies = ["leann-core==0.3.4", "numpy", "protobuf>=3.19.0"]
[tool.scikit-build] [tool.scikit-build]
# Key: simplified CMake path # Key: simplified CMake path

View File

@@ -49,9 +49,28 @@ set(BUILD_TESTING OFF CACHE BOOL "" FORCE)
set(FAISS_ENABLE_C_API OFF CACHE BOOL "" FORCE) set(FAISS_ENABLE_C_API OFF CACHE BOOL "" FORCE)
set(FAISS_OPT_LEVEL "generic" CACHE STRING "" FORCE) set(FAISS_OPT_LEVEL "generic" CACHE STRING "" FORCE)
# Disable additional SIMD versions to speed up compilation # Disable x86-specific SIMD optimizations (important for ARM64 compatibility)
set(FAISS_ENABLE_AVX2 OFF CACHE BOOL "" FORCE) set(FAISS_ENABLE_AVX2 OFF CACHE BOOL "" FORCE)
set(FAISS_ENABLE_AVX512 OFF CACHE BOOL "" FORCE) set(FAISS_ENABLE_AVX512 OFF CACHE BOOL "" FORCE)
set(FAISS_ENABLE_SSE4_1 OFF CACHE BOOL "" FORCE)
# ARM64-specific configuration
if(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|arm64")
message(STATUS "Configuring Faiss for ARM64 architecture")
if(CMAKE_SYSTEM_NAME STREQUAL "Linux")
# Use SVE optimization level for ARM64 Linux (as seen in Faiss conda build)
set(FAISS_OPT_LEVEL "sve" CACHE STRING "" FORCE)
message(STATUS "Setting FAISS_OPT_LEVEL to 'sve' for ARM64 Linux")
else()
# Use generic optimization for other ARM64 platforms (like macOS)
set(FAISS_OPT_LEVEL "generic" CACHE STRING "" FORCE)
message(STATUS "Setting FAISS_OPT_LEVEL to 'generic' for ARM64 ${CMAKE_SYSTEM_NAME}")
endif()
# ARM64 compatibility: Faiss submodule has been modified to fix x86 header inclusion
message(STATUS "Using ARM64-compatible Faiss submodule")
endif()
# Additional optimization options from INSTALL.md # Additional optimization options from INSTALL.md
set(CMAKE_BUILD_TYPE "Release" CACHE STRING "" FORCE) set(CMAKE_BUILD_TYPE "Release" CACHE STRING "" FORCE)

View File

@@ -6,10 +6,10 @@ build-backend = "scikit_build_core.build"
[project] [project]
name = "leann-backend-hnsw" name = "leann-backend-hnsw"
version = "0.3.2" version = "0.3.4"
description = "Custom-built HNSW (Faiss) backend for the Leann toolkit." description = "Custom-built HNSW (Faiss) backend for the Leann toolkit."
dependencies = [ dependencies = [
"leann-core==0.3.2", "leann-core==0.3.4",
"numpy", "numpy",
"pyzmq>=23.0.0", "pyzmq>=23.0.0",
"msgpack>=1.0.0", "msgpack>=1.0.0",

View File

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
[project] [project]
name = "leann-core" name = "leann-core"
version = "0.3.2" version = "0.3.4"
description = "Core API and plugin system for LEANN" description = "Core API and plugin system for LEANN"
readme = "README.md" readme = "README.md"
requires-python = ">=3.9" requires-python = ">=3.9"

View File

@@ -6,6 +6,8 @@ with the correct, original embedding logic from the user's reference code.
import json import json
import logging import logging
import pickle import pickle
import re
import subprocess
import time import time
import warnings import warnings
from dataclasses import dataclass, field from dataclasses import dataclass, field
@@ -653,6 +655,7 @@ class LeannSearcher:
expected_zmq_port: int = 5557, expected_zmq_port: int = 5557,
metadata_filters: Optional[dict[str, dict[str, Union[str, int, float, bool, list]]]] = None, metadata_filters: Optional[dict[str, dict[str, Union[str, int, float, bool, list]]]] = None,
batch_size: int = 0, batch_size: int = 0,
use_grep: bool = False,
**kwargs, **kwargs,
) -> list[SearchResult]: ) -> list[SearchResult]:
""" """
@@ -679,6 +682,10 @@ class LeannSearcher:
Returns: Returns:
List of SearchResult objects with text, metadata, and similarity scores List of SearchResult objects with text, metadata, and similarity scores
""" """
# Handle grep search
if use_grep:
return self._grep_search(query, top_k)
logger.info("🔍 LeannSearcher.search() called:") logger.info("🔍 LeannSearcher.search() called:")
logger.info(f" Query: '{query}'") logger.info(f" Query: '{query}'")
logger.info(f" Top_k: {top_k}") logger.info(f" Top_k: {top_k}")
@@ -795,9 +802,96 @@ class LeannSearcher:
logger.info(f" {GREEN}✓ Final enriched results: {len(enriched_results)} passages{RESET}") logger.info(f" {GREEN}✓ Final enriched results: {len(enriched_results)} passages{RESET}")
return enriched_results return enriched_results
def _find_jsonl_file(self) -> Optional[str]:
"""Find the .jsonl file containing raw passages for grep search"""
index_path = Path(self.meta_path_str).parent
potential_files = [
index_path / "documents.leann.passages.jsonl",
index_path.parent / "documents.leann.passages.jsonl",
]
for file_path in potential_files:
if file_path.exists():
return str(file_path)
return None
def _grep_search(self, query: str, top_k: int = 5) -> list[SearchResult]:
"""Perform grep-based search on raw passages"""
jsonl_file = self._find_jsonl_file()
if not jsonl_file:
raise FileNotFoundError("No .jsonl passages file found for grep search")
try:
cmd = ["grep", "-i", "-n", query, jsonl_file]
result = subprocess.run(cmd, capture_output=True, text=True, check=False)
if result.returncode == 1:
return []
elif result.returncode != 0:
raise RuntimeError(f"Grep failed: {result.stderr}")
matches = []
for line in result.stdout.strip().split("\n"):
if not line:
continue
parts = line.split(":", 1)
if len(parts) != 2:
continue
try:
data = json.loads(parts[1])
text = data.get("text", "")
score = text.lower().count(query.lower())
matches.append(
SearchResult(
id=data.get("id", parts[0]),
text=text,
metadata=data.get("metadata", {}),
score=float(score),
)
)
except json.JSONDecodeError:
continue
matches.sort(key=lambda x: x.score, reverse=True)
return matches[:top_k]
except FileNotFoundError:
raise RuntimeError(
"grep command not found. Please install grep or use semantic search."
)
def _python_regex_search(self, query: str, top_k: int = 5) -> list[SearchResult]:
"""Fallback regex search"""
jsonl_file = self._find_jsonl_file()
if not jsonl_file:
raise FileNotFoundError("No .jsonl file found")
pattern = re.compile(re.escape(query), re.IGNORECASE)
matches = []
with open(jsonl_file, encoding="utf-8") as f:
for line_num, line in enumerate(f, 1):
if pattern.search(line):
try:
data = json.loads(line.strip())
matches.append(
SearchResult(
id=data.get("id", str(line_num)),
text=data.get("text", ""),
metadata=data.get("metadata", {}),
score=float(len(pattern.findall(data.get("text", "")))),
)
)
except json.JSONDecodeError:
continue
matches.sort(key=lambda x: x.score, reverse=True)
return matches[:top_k]
def cleanup(self): def cleanup(self):
"""Explicitly cleanup embedding server resources. """Explicitly cleanup embedding server resources.
This method should be called after you're done using the searcher, This method should be called after you're done using the searcher,
especially in test environments or batch processing scenarios. especially in test environments or batch processing scenarios.
""" """
@@ -853,6 +947,7 @@ class LeannChat:
expected_zmq_port: int = 5557, expected_zmq_port: int = 5557,
metadata_filters: Optional[dict[str, dict[str, Union[str, int, float, bool, list]]]] = None, metadata_filters: Optional[dict[str, dict[str, Union[str, int, float, bool, list]]]] = None,
batch_size: int = 0, batch_size: int = 0,
use_grep: bool = False,
**search_kwargs, **search_kwargs,
): ):
if llm_kwargs is None: if llm_kwargs is None:

View File

@@ -1,6 +1,6 @@
""" """
Enhanced chunking utilities with AST-aware code chunking support. Enhanced chunking utilities with AST-aware code chunking support.
Provides unified interface for both traditional and AST-based text chunking. Packaged within leann-core so installed wheels can import it reliably.
""" """
import logging import logging
@@ -22,30 +22,9 @@ CODE_EXTENSIONS = {
".jsx": "typescript", ".jsx": "typescript",
} }
# Default chunk parameters for different content types
DEFAULT_CHUNK_PARAMS = {
"code": {
"max_chunk_size": 512,
"chunk_overlap": 64,
},
"text": {
"chunk_size": 256,
"chunk_overlap": 128,
},
}
def detect_code_files(documents, code_extensions=None) -> tuple[list, list]: def detect_code_files(documents, code_extensions=None) -> tuple[list, list]:
""" """Separate documents into code files and regular text files."""
Separate documents into code files and regular text files.
Args:
documents: List of LlamaIndex Document objects
code_extensions: Dict mapping file extensions to languages (defaults to CODE_EXTENSIONS)
Returns:
Tuple of (code_documents, text_documents)
"""
if code_extensions is None: if code_extensions is None:
code_extensions = CODE_EXTENSIONS code_extensions = CODE_EXTENSIONS
@@ -53,16 +32,10 @@ def detect_code_files(documents, code_extensions=None) -> tuple[list, list]:
text_docs = [] text_docs = []
for doc in documents: for doc in documents:
# Get file path from metadata file_path = doc.metadata.get("file_path", "") or doc.metadata.get("file_name", "")
file_path = doc.metadata.get("file_path", "")
if not file_path:
# Fallback to file_name
file_path = doc.metadata.get("file_name", "")
if file_path: if file_path:
file_ext = Path(file_path).suffix.lower() file_ext = Path(file_path).suffix.lower()
if file_ext in code_extensions: if file_ext in code_extensions:
# Add language info to metadata
doc.metadata["language"] = code_extensions[file_ext] doc.metadata["language"] = code_extensions[file_ext]
doc.metadata["is_code"] = True doc.metadata["is_code"] = True
code_docs.append(doc) code_docs.append(doc)
@@ -70,7 +43,6 @@ def detect_code_files(documents, code_extensions=None) -> tuple[list, list]:
doc.metadata["is_code"] = False doc.metadata["is_code"] = False
text_docs.append(doc) text_docs.append(doc)
else: else:
# If no file path, treat as text
doc.metadata["is_code"] = False doc.metadata["is_code"] = False
text_docs.append(doc) text_docs.append(doc)
@@ -79,7 +51,7 @@ def detect_code_files(documents, code_extensions=None) -> tuple[list, list]:
def get_language_from_extension(file_path: str) -> Optional[str]: def get_language_from_extension(file_path: str) -> Optional[str]:
"""Get the programming language from file extension.""" """Return language string from a filename/extension using CODE_EXTENSIONS."""
ext = Path(file_path).suffix.lower() ext = Path(file_path).suffix.lower()
return CODE_EXTENSIONS.get(ext) return CODE_EXTENSIONS.get(ext)
@@ -90,40 +62,26 @@ def create_ast_chunks(
chunk_overlap: int = 64, chunk_overlap: int = 64,
metadata_template: str = "default", metadata_template: str = "default",
) -> list[str]: ) -> list[str]:
""" """Create AST-aware chunks from code documents using astchunk.
Create AST-aware chunks from code documents using astchunk.
Args: Falls back to traditional chunking if astchunk is unavailable.
documents: List of code documents
max_chunk_size: Maximum characters per chunk
chunk_overlap: Number of AST nodes to overlap between chunks
metadata_template: Template for chunk metadata
Returns:
List of text chunks with preserved code structure
""" """
try: try:
from astchunk import ASTChunkBuilder from astchunk import ASTChunkBuilder # optional dependency
except ImportError as e: except ImportError as e:
logger.error(f"astchunk not available: {e}") logger.error(f"astchunk not available: {e}")
logger.info("Falling back to traditional chunking for code files") logger.info("Falling back to traditional chunking for code files")
return create_traditional_chunks(documents, max_chunk_size, chunk_overlap) return create_traditional_chunks(documents, max_chunk_size, chunk_overlap)
all_chunks = [] all_chunks = []
for doc in documents: for doc in documents:
# Get language from metadata (set by detect_code_files)
language = doc.metadata.get("language") language = doc.metadata.get("language")
if not language: if not language:
logger.warning( logger.warning("No language detected; falling back to traditional chunking")
"No language detected for document, falling back to traditional chunking" all_chunks.extend(create_traditional_chunks([doc], max_chunk_size, chunk_overlap))
)
traditional_chunks = create_traditional_chunks([doc], max_chunk_size, chunk_overlap)
all_chunks.extend(traditional_chunks)
continue continue
try: try:
# Configure astchunk
configs = { configs = {
"max_chunk_size": max_chunk_size, "max_chunk_size": max_chunk_size,
"language": language, "language": language,
@@ -131,7 +89,6 @@ def create_ast_chunks(
"chunk_overlap": chunk_overlap if chunk_overlap > 0 else 0, "chunk_overlap": chunk_overlap if chunk_overlap > 0 else 0,
} }
# Add repository-level metadata if available
repo_metadata = { repo_metadata = {
"file_path": doc.metadata.get("file_path", ""), "file_path": doc.metadata.get("file_path", ""),
"file_name": doc.metadata.get("file_name", ""), "file_name": doc.metadata.get("file_name", ""),
@@ -140,17 +97,13 @@ def create_ast_chunks(
} }
configs["repo_level_metadata"] = repo_metadata configs["repo_level_metadata"] = repo_metadata
# Create chunk builder and process
chunk_builder = ASTChunkBuilder(**configs) chunk_builder = ASTChunkBuilder(**configs)
code_content = doc.get_content() code_content = doc.get_content()
if not code_content or not code_content.strip(): if not code_content or not code_content.strip():
logger.warning("Empty code content, skipping") logger.warning("Empty code content, skipping")
continue continue
chunks = chunk_builder.chunkify(code_content) chunks = chunk_builder.chunkify(code_content)
# Extract text content from chunks
for chunk in chunks: for chunk in chunks:
if hasattr(chunk, "text"): if hasattr(chunk, "text"):
chunk_text = chunk.text chunk_text = chunk.text
@@ -159,7 +112,6 @@ def create_ast_chunks(
elif isinstance(chunk, str): elif isinstance(chunk, str):
chunk_text = chunk chunk_text = chunk
else: else:
# Try to convert to string
chunk_text = str(chunk) chunk_text = str(chunk)
if chunk_text and chunk_text.strip(): if chunk_text and chunk_text.strip():
@@ -168,12 +120,10 @@ def create_ast_chunks(
logger.info( logger.info(
f"Created {len(chunks)} AST chunks from {language} file: {doc.metadata.get('file_name', 'unknown')}" f"Created {len(chunks)} AST chunks from {language} file: {doc.metadata.get('file_name', 'unknown')}"
) )
except Exception as e: except Exception as e:
logger.warning(f"AST chunking failed for {language} file: {e}") logger.warning(f"AST chunking failed for {language} file: {e}")
logger.info("Falling back to traditional chunking") logger.info("Falling back to traditional chunking")
traditional_chunks = create_traditional_chunks([doc], max_chunk_size, chunk_overlap) all_chunks.extend(create_traditional_chunks([doc], max_chunk_size, chunk_overlap))
all_chunks.extend(traditional_chunks)
return all_chunks return all_chunks
@@ -181,23 +131,10 @@ def create_ast_chunks(
def create_traditional_chunks( def create_traditional_chunks(
documents, chunk_size: int = 256, chunk_overlap: int = 128 documents, chunk_size: int = 256, chunk_overlap: int = 128
) -> list[str]: ) -> list[str]:
""" """Create traditional text chunks using LlamaIndex SentenceSplitter."""
Create traditional text chunks using LlamaIndex SentenceSplitter.
Args:
documents: List of documents to chunk
chunk_size: Size of each chunk in characters
chunk_overlap: Overlap between chunks
Returns:
List of text chunks
"""
# Handle invalid chunk_size values
if chunk_size <= 0: if chunk_size <= 0:
logger.warning(f"Invalid chunk_size={chunk_size}, using default value of 256") logger.warning(f"Invalid chunk_size={chunk_size}, using default value of 256")
chunk_size = 256 chunk_size = 256
# Ensure chunk_overlap is not negative and not larger than chunk_size
if chunk_overlap < 0: if chunk_overlap < 0:
chunk_overlap = 0 chunk_overlap = 0
if chunk_overlap >= chunk_size: if chunk_overlap >= chunk_size:
@@ -215,12 +152,9 @@ def create_traditional_chunks(
try: try:
nodes = node_parser.get_nodes_from_documents([doc]) nodes = node_parser.get_nodes_from_documents([doc])
if nodes: if nodes:
chunk_texts = [node.get_content() for node in nodes] all_texts.extend(node.get_content() for node in nodes)
all_texts.extend(chunk_texts)
logger.debug(f"Created {len(chunk_texts)} traditional chunks from document")
except Exception as e: except Exception as e:
logger.error(f"Traditional chunking failed for document: {e}") logger.error(f"Traditional chunking failed for document: {e}")
# As last resort, add the raw content
content = doc.get_content() content = doc.get_content()
if content and content.strip(): if content and content.strip():
all_texts.append(content.strip()) all_texts.append(content.strip())
@@ -238,32 +172,13 @@ def create_text_chunks(
code_file_extensions: Optional[list[str]] = None, code_file_extensions: Optional[list[str]] = None,
ast_fallback_traditional: bool = True, ast_fallback_traditional: bool = True,
) -> list[str]: ) -> list[str]:
""" """Create text chunks from documents with optional AST support for code files."""
Create text chunks from documents with optional AST support for code files.
Args:
documents: List of LlamaIndex Document objects
chunk_size: Size for traditional text chunks
chunk_overlap: Overlap for traditional text chunks
use_ast_chunking: Whether to use AST chunking for code files
ast_chunk_size: Size for AST chunks
ast_chunk_overlap: Overlap for AST chunks
code_file_extensions: Custom list of code file extensions
ast_fallback_traditional: Fall back to traditional chunking on AST errors
Returns:
List of text chunks
"""
if not documents: if not documents:
logger.warning("No documents provided for chunking") logger.warning("No documents provided for chunking")
return [] return []
# Create a local copy of supported extensions for this function call
local_code_extensions = CODE_EXTENSIONS.copy() local_code_extensions = CODE_EXTENSIONS.copy()
# Update supported extensions if provided
if code_file_extensions: if code_file_extensions:
# Map extensions to languages (simplified mapping)
ext_mapping = { ext_mapping = {
".py": "python", ".py": "python",
".java": "java", ".java": "java",
@@ -273,47 +188,32 @@ def create_text_chunks(
} }
for ext in code_file_extensions: for ext in code_file_extensions:
if ext.lower() not in local_code_extensions: if ext.lower() not in local_code_extensions:
# Try to guess language from extension
if ext.lower() in ext_mapping: if ext.lower() in ext_mapping:
local_code_extensions[ext.lower()] = ext_mapping[ext.lower()] local_code_extensions[ext.lower()] = ext_mapping[ext.lower()]
else: else:
logger.warning(f"Unsupported extension {ext}, will use traditional chunking") logger.warning(f"Unsupported extension {ext}, will use traditional chunking")
all_chunks = [] all_chunks = []
if use_ast_chunking: if use_ast_chunking:
# Separate code and text documents using local extensions
code_docs, text_docs = detect_code_files(documents, local_code_extensions) code_docs, text_docs = detect_code_files(documents, local_code_extensions)
# Process code files with AST chunking
if code_docs: if code_docs:
logger.info(f"Processing {len(code_docs)} code files with AST chunking")
try: try:
ast_chunks = create_ast_chunks( all_chunks.extend(
code_docs, max_chunk_size=ast_chunk_size, chunk_overlap=ast_chunk_overlap create_ast_chunks(
code_docs, max_chunk_size=ast_chunk_size, chunk_overlap=ast_chunk_overlap
)
) )
all_chunks.extend(ast_chunks)
logger.info(f"Created {len(ast_chunks)} AST chunks from code files")
except Exception as e: except Exception as e:
logger.error(f"AST chunking failed: {e}") logger.error(f"AST chunking failed: {e}")
if ast_fallback_traditional: if ast_fallback_traditional:
logger.info("Falling back to traditional chunking for code files") all_chunks.extend(
traditional_code_chunks = create_traditional_chunks( create_traditional_chunks(code_docs, chunk_size, chunk_overlap)
code_docs, chunk_size, chunk_overlap
) )
all_chunks.extend(traditional_code_chunks)
else: else:
raise raise
# Process text files with traditional chunking
if text_docs: if text_docs:
logger.info(f"Processing {len(text_docs)} text files with traditional chunking") all_chunks.extend(create_traditional_chunks(text_docs, chunk_size, chunk_overlap))
text_chunks = create_traditional_chunks(text_docs, chunk_size, chunk_overlap)
all_chunks.extend(text_chunks)
logger.info(f"Created {len(text_chunks)} traditional chunks from text files")
else: else:
# Use traditional chunking for all files
logger.info(f"Processing {len(documents)} documents with traditional chunking")
all_chunks = create_traditional_chunks(documents, chunk_size, chunk_overlap) all_chunks = create_traditional_chunks(documents, chunk_size, chunk_overlap)
logger.info(f"Total chunks created: {len(all_chunks)}") logger.info(f"Total chunks created: {len(all_chunks)}")

View File

@@ -1,6 +1,5 @@
import argparse import argparse
import asyncio import asyncio
import sys
from pathlib import Path from pathlib import Path
from typing import Any, Optional, Union from typing import Any, Optional, Union
@@ -322,9 +321,17 @@ Examples:
return basic_matches return basic_matches
def _should_exclude_file(self, relative_path: Path, gitignore_matches) -> bool: def _should_exclude_file(self, file_path: Path, gitignore_matches) -> bool:
"""Check if a file should be excluded using gitignore parser.""" """Check if a file should be excluded using gitignore parser.
return gitignore_matches(str(relative_path))
Always match against absolute, posix-style paths for consistency with
gitignore_parser expectations.
"""
try:
absolute_path = file_path.resolve()
except Exception:
absolute_path = Path(str(file_path))
return gitignore_matches(absolute_path.as_posix())
def _is_git_submodule(self, path: Path) -> bool: def _is_git_submodule(self, path: Path) -> bool:
"""Check if a path is a git submodule.""" """Check if a path is a git submodule."""
@@ -396,7 +403,9 @@ Examples:
print(f" {current_path}") print(f" {current_path}")
print(" " + "" * 45) print(" " + "" * 45)
current_indexes = self._discover_indexes_in_project(current_path) current_indexes = self._discover_indexes_in_project(
current_path, exclude_dirs=other_projects
)
if current_indexes: if current_indexes:
for idx in current_indexes: for idx in current_indexes:
total_indexes += 1 total_indexes += 1
@@ -435,9 +444,14 @@ Examples:
print(" leann build my-docs --docs ./documents") print(" leann build my-docs --docs ./documents")
else: else:
# Count only projects that have at least one discoverable index # Count only projects that have at least one discoverable index
projects_count = sum( projects_count = 0
1 for p in valid_projects if len(self._discover_indexes_in_project(p)) > 0 for p in valid_projects:
) if p == current_path:
discovered = self._discover_indexes_in_project(p, exclude_dirs=other_projects)
else:
discovered = self._discover_indexes_in_project(p)
if len(discovered) > 0:
projects_count += 1
print(f"📊 Total: {total_indexes} indexes across {projects_count} projects") print(f"📊 Total: {total_indexes} indexes across {projects_count} projects")
if current_indexes_count > 0: if current_indexes_count > 0:
@@ -454,9 +468,22 @@ Examples:
print("\n💡 Create your first index:") print("\n💡 Create your first index:")
print(" leann build my-docs --docs ./documents") print(" leann build my-docs --docs ./documents")
def _discover_indexes_in_project(self, project_path: Path): def _discover_indexes_in_project(
"""Discover all indexes in a project directory (both CLI and apps formats)""" self, project_path: Path, exclude_dirs: Optional[list[Path]] = None
):
"""Discover all indexes in a project directory (both CLI and apps formats)
exclude_dirs: when provided, skip any APP-format index files that are
located under these directories. This prevents duplicates when the
current project is a parent directory of other registered projects.
"""
indexes = [] indexes = []
exclude_dirs = exclude_dirs or []
# normalize to resolved paths once for comparison
try:
exclude_dirs_resolved = [p.resolve() for p in exclude_dirs]
except Exception:
exclude_dirs_resolved = exclude_dirs
# 1. CLI format: .leann/indexes/index_name/ # 1. CLI format: .leann/indexes/index_name/
cli_indexes_dir = project_path / ".leann" / "indexes" cli_indexes_dir = project_path / ".leann" / "indexes"
@@ -495,6 +522,17 @@ Examples:
continue continue
except Exception: except Exception:
pass pass
# Skip meta files that live under excluded directories
try:
meta_parent_resolved = meta_file.parent.resolve()
if any(
meta_parent_resolved.is_relative_to(ex_dir)
for ex_dir in exclude_dirs_resolved
):
continue
except Exception:
# best effort; if resolve or comparison fails, do not exclude
pass
# Use the parent directory name as the app index display name # Use the parent directory name as the app index display name
display_name = meta_file.parent.name display_name = meta_file.parent.name
# Extract file base used to store files # Extract file base used to store files
@@ -1022,7 +1060,8 @@ Examples:
# Try to use better PDF parsers first, but only if PDFs are requested # Try to use better PDF parsers first, but only if PDFs are requested
documents = [] documents = []
docs_path = Path(docs_dir) # Use resolved absolute paths to avoid mismatches (symlinks, relative vs absolute)
docs_path = Path(docs_dir).resolve()
# Check if we should process PDFs # Check if we should process PDFs
should_process_pdfs = custom_file_types is None or ".pdf" in custom_file_types should_process_pdfs = custom_file_types is None or ".pdf" in custom_file_types
@@ -1031,10 +1070,15 @@ Examples:
for file_path in docs_path.rglob("*.pdf"): for file_path in docs_path.rglob("*.pdf"):
# Check if file matches any exclude pattern # Check if file matches any exclude pattern
try: try:
# Ensure both paths are resolved before computing relativity
file_path_resolved = file_path.resolve()
# Determine directory scope using the non-resolved path to avoid
# misclassifying symlinked entries as outside the docs directory
relative_path = file_path.relative_to(docs_path) relative_path = file_path.relative_to(docs_path)
if not include_hidden and _path_has_hidden_segment(relative_path): if not include_hidden and _path_has_hidden_segment(relative_path):
continue continue
if self._should_exclude_file(relative_path, gitignore_matches): # Use absolute path for gitignore matching
if self._should_exclude_file(file_path_resolved, gitignore_matches):
continue continue
except ValueError: except ValueError:
# Skip files that can't be made relative to docs_path # Skip files that can't be made relative to docs_path
@@ -1077,10 +1121,11 @@ Examples:
) -> bool: ) -> bool:
"""Return True if file should be included (not excluded)""" """Return True if file should be included (not excluded)"""
try: try:
docs_path_obj = Path(docs_dir) docs_path_obj = Path(docs_dir).resolve()
file_path_obj = Path(file_path) file_path_obj = Path(file_path).resolve()
relative_path = file_path_obj.relative_to(docs_path_obj) # Use absolute path for gitignore matching
return not self._should_exclude_file(relative_path, gitignore_matches) _ = file_path_obj.relative_to(docs_path_obj) # validate scope
return not self._should_exclude_file(file_path_obj, gitignore_matches)
except (ValueError, OSError): except (ValueError, OSError):
return True # Include files that can't be processed return True # Include files that can't be processed
@@ -1170,13 +1215,8 @@ Examples:
if use_ast: if use_ast:
print("🧠 Using AST-aware chunking for code files") print("🧠 Using AST-aware chunking for code files")
try: try:
# Import enhanced chunking utilities # Import enhanced chunking utilities from packaged module
# Add apps directory to path to import chunking utilities from .chunking_utils import create_text_chunks
apps_dir = Path(__file__).parent.parent.parent.parent.parent / "apps"
if apps_dir.exists():
sys.path.insert(0, str(apps_dir))
from chunking import create_text_chunks
# Use enhanced chunking with AST support # Use enhanced chunking with AST support
all_texts = create_text_chunks( all_texts = create_text_chunks(
@@ -1191,7 +1231,9 @@ Examples:
) )
except ImportError as e: except ImportError as e:
print(f"⚠️ AST chunking not available ({e}), falling back to traditional chunking") print(
f"⚠️ AST chunking utilities not available in package ({e}), falling back to traditional chunking"
)
use_ast = False use_ast = False
if not use_ast: if not use_ast:

View File

@@ -2,6 +2,8 @@
Transform your development workflow with intelligent code assistance using LEANN's semantic search directly in Claude Code. Transform your development workflow with intelligent code assistance using LEANN's semantic search directly in Claude Code.
For agent-facing discovery details, see `llms.txt` in the repository root.
## Prerequisites ## Prerequisites
Install LEANN globally for MCP integration (with default backend): Install LEANN globally for MCP integration (with default backend):

View File

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
[project] [project]
name = "leann" name = "leann"
version = "0.3.2" version = "0.3.4"
description = "LEANN - The smallest vector index in the world. RAG Everything with LEANN!" description = "LEANN - The smallest vector index in the world. RAG Everything with LEANN!"
readme = "README.md" readme = "README.md"
requires-python = ">=3.9" requires-python = ">=3.9"

View File

@@ -99,6 +99,7 @@ wechat-exporter = "wechat_exporter.main:main"
leann-core = { path = "packages/leann-core", editable = true } leann-core = { path = "packages/leann-core", editable = true }
leann-backend-diskann = { path = "packages/leann-backend-diskann", editable = true } leann-backend-diskann = { path = "packages/leann-backend-diskann", editable = true }
leann-backend-hnsw = { path = "packages/leann-backend-hnsw", editable = true } leann-backend-hnsw = { path = "packages/leann-backend-hnsw", editable = true }
astchunk = { path = "packages/astchunk-leann", editable = true }
[tool.ruff] [tool.ruff]
target-version = "py39" target-version = "py39"

45
uv.lock generated
View File

@@ -1,5 +1,5 @@
version = 1 version = 1
revision = 3 revision = 2
requires-python = ">=3.9" requires-python = ">=3.9"
resolution-markers = [ resolution-markers = [
"python_full_version >= '3.12'", "python_full_version >= '3.12'",
@@ -201,7 +201,7 @@ wheels = [
[[package]] [[package]]
name = "astchunk" name = "astchunk"
version = "0.1.0" version = "0.1.0"
source = { registry = "https://pypi.org/simple" } source = { editable = "packages/astchunk-leann" }
dependencies = [ dependencies = [
{ name = "numpy", version = "2.0.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" }, { name = "numpy", version = "2.0.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" },
{ name = "numpy", version = "2.2.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version == '3.10.*'" }, { name = "numpy", version = "2.2.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version == '3.10.*'" },
@@ -214,10 +214,31 @@ dependencies = [
{ name = "tree-sitter-python" }, { name = "tree-sitter-python" },
{ name = "tree-sitter-typescript" }, { name = "tree-sitter-typescript" },
] ]
sdist = { url = "https://files.pythonhosted.org/packages/db/2a/7a35e2fac7d550265ae2ee40651425083b37555f921d1a1b77c3f525e0df/astchunk-0.1.0.tar.gz", hash = "sha256:f4dff0ef8b3b3bcfeac363384db1e153f74d4c825dc2e35864abfab027713be4", size = 18093, upload-time = "2025-06-19T04:37:25.34Z" }
wheels = [ [package.metadata]
{ url = "https://files.pythonhosted.org/packages/be/84/5433ab0e933b572750cb16fd7edf3d6c7902b069461a22ec670042752a4d/astchunk-0.1.0-py3-none-any.whl", hash = "sha256:33ada9fc3620807fdda5846fa1948af463f281a60e0d43d4f3782b6dbb416d24", size = 15396, upload-time = "2025-06-19T04:37:23.87Z" }, requires-dist = [
{ name = "black", marker = "extra == 'dev'", specifier = ">=22.0.0" },
{ name = "flake8", marker = "extra == 'dev'", specifier = ">=5.0.0" },
{ name = "isort", marker = "extra == 'dev'", specifier = ">=5.10.0" },
{ name = "mypy", marker = "extra == 'dev'", specifier = ">=1.0.0" },
{ name = "myst-parser", marker = "extra == 'docs'", specifier = ">=0.18.0" },
{ name = "numpy", specifier = ">=1.20.0" },
{ name = "pre-commit", marker = "extra == 'dev'", specifier = ">=2.20.0" },
{ name = "pyrsistent", specifier = ">=0.18.0" },
{ name = "pytest", marker = "extra == 'dev'", specifier = ">=7.0.0" },
{ name = "pytest", marker = "extra == 'test'", specifier = ">=7.0.0" },
{ name = "pytest-cov", marker = "extra == 'dev'", specifier = ">=4.0.0" },
{ name = "pytest-cov", marker = "extra == 'test'", specifier = ">=4.0.0" },
{ name = "pytest-xdist", marker = "extra == 'test'", specifier = ">=2.5.0" },
{ name = "sphinx", marker = "extra == 'docs'", specifier = ">=5.0.0" },
{ name = "sphinx-rtd-theme", marker = "extra == 'docs'", specifier = ">=1.0.0" },
{ name = "tree-sitter", specifier = ">=0.20.0" },
{ name = "tree-sitter-c-sharp", specifier = ">=0.20.0" },
{ name = "tree-sitter-java", specifier = ">=0.20.0" },
{ name = "tree-sitter-python", specifier = ">=0.20.0" },
{ name = "tree-sitter-typescript", specifier = ">=0.20.0" },
] ]
provides-extras = ["dev", "docs", "test"]
[[package]] [[package]]
name = "asttokens" name = "asttokens"
@@ -1564,7 +1585,7 @@ name = "importlib-metadata"
version = "8.7.0" version = "8.7.0"
source = { registry = "https://pypi.org/simple" } source = { registry = "https://pypi.org/simple" }
dependencies = [ dependencies = [
{ name = "zipp" }, { name = "zipp", marker = "python_full_version < '3.10'" },
] ]
sdist = { url = "https://files.pythonhosted.org/packages/76/66/650a33bd90f786193e4de4b3ad86ea60b53c89b669a5c7be931fac31cdb0/importlib_metadata-8.7.0.tar.gz", hash = "sha256:d13b81ad223b890aa16c5471f2ac3056cf76c5f10f82d6f9292f0b415f389000", size = 56641, upload-time = "2025-04-27T15:29:01.736Z" } sdist = { url = "https://files.pythonhosted.org/packages/76/66/650a33bd90f786193e4de4b3ad86ea60b53c89b669a5c7be931fac31cdb0/importlib_metadata-8.7.0.tar.gz", hash = "sha256:d13b81ad223b890aa16c5471f2ac3056cf76c5f10f82d6f9292f0b415f389000", size = 56641, upload-time = "2025-04-27T15:29:01.736Z" }
wheels = [ wheels = [
@@ -2117,7 +2138,7 @@ wheels = [
[[package]] [[package]]
name = "leann-backend-diskann" name = "leann-backend-diskann"
version = "0.3.2" version = "0.3.3"
source = { editable = "packages/leann-backend-diskann" } source = { editable = "packages/leann-backend-diskann" }
dependencies = [ dependencies = [
{ name = "leann-core" }, { name = "leann-core" },
@@ -2129,14 +2150,14 @@ dependencies = [
[package.metadata] [package.metadata]
requires-dist = [ requires-dist = [
{ name = "leann-core", specifier = "==0.3.2" }, { name = "leann-core", specifier = "==0.3.3" },
{ name = "numpy" }, { name = "numpy" },
{ name = "protobuf", specifier = ">=3.19.0" }, { name = "protobuf", specifier = ">=3.19.0" },
] ]
[[package]] [[package]]
name = "leann-backend-hnsw" name = "leann-backend-hnsw"
version = "0.3.2" version = "0.3.3"
source = { editable = "packages/leann-backend-hnsw" } source = { editable = "packages/leann-backend-hnsw" }
dependencies = [ dependencies = [
{ name = "leann-core" }, { name = "leann-core" },
@@ -2149,7 +2170,7 @@ dependencies = [
[package.metadata] [package.metadata]
requires-dist = [ requires-dist = [
{ name = "leann-core", specifier = "==0.3.2" }, { name = "leann-core", specifier = "==0.3.3" },
{ name = "msgpack", specifier = ">=1.0.0" }, { name = "msgpack", specifier = ">=1.0.0" },
{ name = "numpy" }, { name = "numpy" },
{ name = "pyzmq", specifier = ">=23.0.0" }, { name = "pyzmq", specifier = ">=23.0.0" },
@@ -2157,7 +2178,7 @@ requires-dist = [
[[package]] [[package]]
name = "leann-core" name = "leann-core"
version = "0.3.2" version = "0.3.3"
source = { editable = "packages/leann-core" } source = { editable = "packages/leann-core" }
dependencies = [ dependencies = [
{ name = "accelerate" }, { name = "accelerate" },
@@ -2297,7 +2318,7 @@ test = [
[package.metadata] [package.metadata]
requires-dist = [ requires-dist = [
{ name = "astchunk", specifier = ">=0.1.0" }, { name = "astchunk", editable = "packages/astchunk-leann" },
{ name = "beautifulsoup4", marker = "extra == 'documents'", specifier = ">=4.13.0" }, { name = "beautifulsoup4", marker = "extra == 'documents'", specifier = ">=4.13.0" },
{ name = "black", marker = "extra == 'dev'", specifier = ">=23.0" }, { name = "black", marker = "extra == 'dev'", specifier = ">=23.0" },
{ name = "boto3" }, { name = "boto3" },