Compare commits

..

6 Commits

Author SHA1 Message Date
yichuan520030910320
72c22e93e1 chore(submodule): bump astchunk-leann to latest main 2025-09-08 15:04:43 -07:00
yichuan520030910320
b2eb34dc17 submodule(packages/astchunk-leann): track main branch in .gitmodules 2025-09-08 15:02:36 -07:00
yichuan520030910320
2d58650b04 finish ast fork 2025-09-08 14:34:41 -07:00
yichuan520030910320
950728bd25 Move astchunk submodule under packages/ and update path in .gitmodules 2025-09-08 14:26:01 -07:00
Aiden Huang
e309f292de docs(mcp): add root llms.txt for MCP discovery; update MCP README to reference it; refs #76 (#91) 2025-09-07 14:39:58 -07:00
AWS Mcleod
0d9f92ea0f Add grep search functionality - Issue #86 (#87)
* Add grep search functionality to LeannSearcher

- Add use_grep parameter to search method
- Implement grep-based search on .jsonl files
- Add fallback Python regex search
- Support same SearchResult format as semantic search

Addresses issue #86

* fix: resolve linting errors

* docs: add grep search example

* docs: add grep search to README examples

* refactor: remove regex fallback, move grep example to features section

* docs: add grep search to Advanced Features with comprehensive guide
2025-09-05 13:48:07 -07:00
12 changed files with 378 additions and 13 deletions

1
.gitignore vendored
View File

@@ -22,6 +22,7 @@ demo/experiment_results/**/*.json
*.sh *.sh
*.txt *.txt
!CMakeLists.txt !CMakeLists.txt
!llms.txt
latency_breakdown*.json latency_breakdown*.json
experiment_results/eval_results/diskann/*.json experiment_results/eval_results/diskann/*.json
aws/ aws/

4
.gitmodules vendored
View File

@@ -14,3 +14,7 @@
[submodule "packages/leann-backend-hnsw/third_party/libzmq"] [submodule "packages/leann-backend-hnsw/third_party/libzmq"]
path = packages/leann-backend-hnsw/third_party/libzmq path = packages/leann-backend-hnsw/third_party/libzmq
url = https://github.com/zeromq/libzmq.git url = https://github.com/zeromq/libzmq.git
[submodule "packages/astchunk-leann"]
path = packages/astchunk-leann
url = git@github.com:yichuan-w/astchunk-leann.git
branch = main

View File

@@ -656,6 +656,19 @@ results = searcher.search(
📖 **[Complete Metadata filtering guide →](docs/metadata_filtering.md)** 📖 **[Complete Metadata filtering guide →](docs/metadata_filtering.md)**
### 🔍 Grep Search
For exact text matching instead of semantic search, use the `use_grep` parameter:
```python
# Exact text search
results = searcher.search("bananacrocodile", use_grep=True, top_k=1)
```
**Use cases**: Finding specific code patterns, error messages, function names, or exact phrases where semantic similarity isn't needed.
📖 **[Complete grep search guide →](docs/grep_search.md)**
## 🏗️ Architecture & How It Works ## 🏗️ Architecture & How It Works
<p align="center"> <p align="center">

View File

@@ -26,6 +26,21 @@ leann build my-code-index --docs ./src --use-ast-chunking
uv pip install -e "." uv pip install -e "."
``` ```
#### For normal users (PyPI install)
- Use `pip install leann` or `uv pip install leann`.
- `astchunk` is pulled automatically from PyPI as a dependency; no extra steps.
#### For developers (from source, editable)
```bash
git clone https://github.com/yichuan-w/LEANN.git leann
cd leann
git submodule update --init --recursive
uv sync
```
- This repo vendors `astchunk` as a git submodule at `packages/astchunk-leann` (our fork).
- `[tool.uv.sources]` maps the `astchunk` package to that path in editable mode.
- You can edit code under `packages/astchunk-leann` and Python will use your changes immediately (no separate `pip install astchunk` needed).
## Best Practices ## Best Practices
### When to Use AST Chunking ### When to Use AST Chunking

149
docs/grep_search.md Normal file
View File

@@ -0,0 +1,149 @@
# LEANN Grep Search Usage Guide
## Overview
LEANN's grep search functionality provides exact text matching for finding specific code patterns, error messages, function names, or exact phrases in your indexed documents.
## Basic Usage
### Simple Grep Search
```python
from leann.api import LeannSearcher
searcher = LeannSearcher("your_index_path")
# Exact text search
results = searcher.search("def authenticate_user", use_grep=True, top_k=5)
for result in results:
print(f"Score: {result.score}")
print(f"Text: {result.text[:100]}...")
print("-" * 40)
```
### Comparison: Semantic vs Grep Search
```python
# Semantic search - finds conceptually similar content
semantic_results = searcher.search("machine learning algorithms", top_k=3)
# Grep search - finds exact text matches
grep_results = searcher.search("def train_model", use_grep=True, top_k=3)
```
## When to Use Grep Search
### Use Cases
- **Code Search**: Finding specific function definitions, class names, or variable references
- **Error Debugging**: Locating exact error messages or stack traces
- **Documentation**: Finding specific API endpoints or exact terminology
### Examples
```python
# Find function definitions
functions = searcher.search("def __init__", use_grep=True)
# Find import statements
imports = searcher.search("from sklearn import", use_grep=True)
# Find specific error types
errors = searcher.search("FileNotFoundError", use_grep=True)
# Find TODO comments
todos = searcher.search("TODO:", use_grep=True)
# Find configuration entries
configs = searcher.search("server_port=", use_grep=True)
```
## Technical Details
### How It Works
1. **File Location**: Grep search operates on the raw text stored in `.jsonl` files
2. **Command Execution**: Uses the system `grep` command with case-insensitive search
3. **Result Processing**: Parses JSON lines and extracts text and metadata
4. **Scoring**: Simple frequency-based scoring based on query term occurrences
### Search Process
```
Query: "def train_model"
grep -i -n "def train_model" documents.leann.passages.jsonl
Parse matching JSON lines
Calculate scores based on term frequency
Return top_k results
```
### Scoring Algorithm
```python
# Term frequency in document
score = text.lower().count(query.lower())
```
Results are ranked by score (highest first), with higher scores indicating more occurrences of the search term.
## Error Handling
### Common Issues
#### Grep Command Not Found
```
RuntimeError: grep command not found. Please install grep or use semantic search.
```
**Solution**: Install grep on your system:
- **Ubuntu/Debian**: `sudo apt-get install grep`
- **macOS**: grep is pre-installed
- **Windows**: Use WSL or install grep via Git Bash/MSYS2
#### No Results Found
```python
# Check if your query exists in the raw data
results = searcher.search("your_query", use_grep=True)
if not results:
print("No exact matches found. Try:")
print("1. Check spelling and case")
print("2. Use partial terms")
print("3. Switch to semantic search")
```
## Complete Example
```python
#!/usr/bin/env python3
"""
Grep Search Example
Demonstrates grep search for exact text matching.
"""
from leann.api import LeannSearcher
def demonstrate_grep_search():
# Initialize searcher
searcher = LeannSearcher("my_index")
print("=== Function Search ===")
functions = searcher.search("def __init__", use_grep=True, top_k=5)
for i, result in enumerate(functions, 1):
print(f"{i}. Score: {result.score}")
print(f" Preview: {result.text[:60]}...")
print()
print("=== Error Search ===")
errors = searcher.search("FileNotFoundError", use_grep=True, top_k=3)
for result in errors:
print(f"Content: {result.text.strip()}")
print("-" * 40)
if __name__ == "__main__":
demonstrate_grep_search()
```

View File

@@ -0,0 +1,35 @@
"""
Grep Search Example
Shows how to use grep-based text search instead of semantic search.
Useful when you need exact text matches rather than meaning-based results.
"""
from leann import LeannSearcher
# Load your index
searcher = LeannSearcher("my-documents.leann")
# Regular semantic search
print("=== Semantic Search ===")
results = searcher.search("machine learning algorithms", top_k=3)
for result in results:
print(f"Score: {result.score:.3f}")
print(f"Text: {result.text[:80]}...")
print()
# Grep-based search for exact text matches
print("=== Grep Search ===")
results = searcher.search("def train_model", top_k=3, use_grep=True)
for result in results:
print(f"Score: {result.score}")
print(f"Text: {result.text[:80]}...")
print()
# Find specific error messages
error_results = searcher.search("FileNotFoundError", use_grep=True)
print(f"Found {len(error_results)} files mentioning FileNotFoundError")
# Search for function definitions
func_results = searcher.search("class SearchResult", use_grep=True, top_k=5)
print(f"Found {len(func_results)} class definitions")

28
llms.txt Normal file
View File

@@ -0,0 +1,28 @@
# llms.txt — LEANN MCP and Agent Integration
product: LEANN
homepage: https://github.com/yichuan-w/LEANN
contact: https://github.com/yichuan-w/LEANN/issues
# Installation
install: uv tool install leann-core --with leann
# MCP Server Entry Point
mcp.server: leann_mcp
mcp.protocol_version: 2024-11-05
# Tools
mcp.tools: leann_list, leann_search
mcp.tool.leann_list.description: List available LEANN indexes
mcp.tool.leann_list.input: {}
mcp.tool.leann_search.description: Semantic search across a named LEANN index
mcp.tool.leann_search.input.index_name: string, required
mcp.tool.leann_search.input.query: string, required
mcp.tool.leann_search.input.top_k: integer, optional, default=5, min=1, max=20
mcp.tool.leann_search.input.complexity: integer, optional, default=32, min=16, max=128
# Notes
note: Build indexes with `leann build <name> --docs <files...>` before searching.
example.add: claude mcp add --scope user leann-server -- leann_mcp
example.verify: claude mcp list | cat

View File

@@ -6,6 +6,8 @@ with the correct, original embedding logic from the user's reference code.
import json import json
import logging import logging
import pickle import pickle
import re
import subprocess
import time import time
import warnings import warnings
from dataclasses import dataclass, field from dataclasses import dataclass, field
@@ -653,6 +655,7 @@ class LeannSearcher:
expected_zmq_port: int = 5557, expected_zmq_port: int = 5557,
metadata_filters: Optional[dict[str, dict[str, Union[str, int, float, bool, list]]]] = None, metadata_filters: Optional[dict[str, dict[str, Union[str, int, float, bool, list]]]] = None,
batch_size: int = 0, batch_size: int = 0,
use_grep: bool = False,
**kwargs, **kwargs,
) -> list[SearchResult]: ) -> list[SearchResult]:
""" """
@@ -679,6 +682,10 @@ class LeannSearcher:
Returns: Returns:
List of SearchResult objects with text, metadata, and similarity scores List of SearchResult objects with text, metadata, and similarity scores
""" """
# Handle grep search
if use_grep:
return self._grep_search(query, top_k)
logger.info("🔍 LeannSearcher.search() called:") logger.info("🔍 LeannSearcher.search() called:")
logger.info(f" Query: '{query}'") logger.info(f" Query: '{query}'")
logger.info(f" Top_k: {top_k}") logger.info(f" Top_k: {top_k}")
@@ -795,9 +802,96 @@ class LeannSearcher:
logger.info(f" {GREEN}✓ Final enriched results: {len(enriched_results)} passages{RESET}") logger.info(f" {GREEN}✓ Final enriched results: {len(enriched_results)} passages{RESET}")
return enriched_results return enriched_results
def _find_jsonl_file(self) -> Optional[str]:
"""Find the .jsonl file containing raw passages for grep search"""
index_path = Path(self.meta_path_str).parent
potential_files = [
index_path / "documents.leann.passages.jsonl",
index_path.parent / "documents.leann.passages.jsonl",
]
for file_path in potential_files:
if file_path.exists():
return str(file_path)
return None
def _grep_search(self, query: str, top_k: int = 5) -> list[SearchResult]:
"""Perform grep-based search on raw passages"""
jsonl_file = self._find_jsonl_file()
if not jsonl_file:
raise FileNotFoundError("No .jsonl passages file found for grep search")
try:
cmd = ["grep", "-i", "-n", query, jsonl_file]
result = subprocess.run(cmd, capture_output=True, text=True, check=False)
if result.returncode == 1:
return []
elif result.returncode != 0:
raise RuntimeError(f"Grep failed: {result.stderr}")
matches = []
for line in result.stdout.strip().split("\n"):
if not line:
continue
parts = line.split(":", 1)
if len(parts) != 2:
continue
try:
data = json.loads(parts[1])
text = data.get("text", "")
score = text.lower().count(query.lower())
matches.append(
SearchResult(
id=data.get("id", parts[0]),
text=text,
metadata=data.get("metadata", {}),
score=float(score),
)
)
except json.JSONDecodeError:
continue
matches.sort(key=lambda x: x.score, reverse=True)
return matches[:top_k]
except FileNotFoundError:
raise RuntimeError(
"grep command not found. Please install grep or use semantic search."
)
def _python_regex_search(self, query: str, top_k: int = 5) -> list[SearchResult]:
"""Fallback regex search"""
jsonl_file = self._find_jsonl_file()
if not jsonl_file:
raise FileNotFoundError("No .jsonl file found")
pattern = re.compile(re.escape(query), re.IGNORECASE)
matches = []
with open(jsonl_file, encoding="utf-8") as f:
for line_num, line in enumerate(f, 1):
if pattern.search(line):
try:
data = json.loads(line.strip())
matches.append(
SearchResult(
id=data.get("id", str(line_num)),
text=data.get("text", ""),
metadata=data.get("metadata", {}),
score=float(len(pattern.findall(data.get("text", "")))),
)
)
except json.JSONDecodeError:
continue
matches.sort(key=lambda x: x.score, reverse=True)
return matches[:top_k]
def cleanup(self): def cleanup(self):
"""Explicitly cleanup embedding server resources. """Explicitly cleanup embedding server resources.
This method should be called after you're done using the searcher, This method should be called after you're done using the searcher,
especially in test environments or batch processing scenarios. especially in test environments or batch processing scenarios.
""" """
@@ -853,6 +947,7 @@ class LeannChat:
expected_zmq_port: int = 5557, expected_zmq_port: int = 5557,
metadata_filters: Optional[dict[str, dict[str, Union[str, int, float, bool, list]]]] = None, metadata_filters: Optional[dict[str, dict[str, Union[str, int, float, bool, list]]]] = None,
batch_size: int = 0, batch_size: int = 0,
use_grep: bool = False,
**search_kwargs, **search_kwargs,
): ):
if llm_kwargs is None: if llm_kwargs is None:

View File

@@ -2,6 +2,8 @@
Transform your development workflow with intelligent code assistance using LEANN's semantic search directly in Claude Code. Transform your development workflow with intelligent code assistance using LEANN's semantic search directly in Claude Code.
For agent-facing discovery details, see `llms.txt` in the repository root.
## Prerequisites ## Prerequisites
Install LEANN globally for MCP integration (with default backend): Install LEANN globally for MCP integration (with default backend):

View File

@@ -99,6 +99,7 @@ wechat-exporter = "wechat_exporter.main:main"
leann-core = { path = "packages/leann-core", editable = true } leann-core = { path = "packages/leann-core", editable = true }
leann-backend-diskann = { path = "packages/leann-backend-diskann", editable = true } leann-backend-diskann = { path = "packages/leann-backend-diskann", editable = true }
leann-backend-hnsw = { path = "packages/leann-backend-hnsw", editable = true } leann-backend-hnsw = { path = "packages/leann-backend-hnsw", editable = true }
astchunk = { path = "packages/astchunk-leann", editable = true }
[tool.ruff] [tool.ruff]
target-version = "py39" target-version = "py39"

45
uv.lock generated
View File

@@ -1,5 +1,5 @@
version = 1 version = 1
revision = 3 revision = 2
requires-python = ">=3.9" requires-python = ">=3.9"
resolution-markers = [ resolution-markers = [
"python_full_version >= '3.12'", "python_full_version >= '3.12'",
@@ -201,7 +201,7 @@ wheels = [
[[package]] [[package]]
name = "astchunk" name = "astchunk"
version = "0.1.0" version = "0.1.0"
source = { registry = "https://pypi.org/simple" } source = { editable = "packages/astchunk-leann" }
dependencies = [ dependencies = [
{ name = "numpy", version = "2.0.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" }, { name = "numpy", version = "2.0.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.10'" },
{ name = "numpy", version = "2.2.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version == '3.10.*'" }, { name = "numpy", version = "2.2.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version == '3.10.*'" },
@@ -214,10 +214,31 @@ dependencies = [
{ name = "tree-sitter-python" }, { name = "tree-sitter-python" },
{ name = "tree-sitter-typescript" }, { name = "tree-sitter-typescript" },
] ]
sdist = { url = "https://files.pythonhosted.org/packages/db/2a/7a35e2fac7d550265ae2ee40651425083b37555f921d1a1b77c3f525e0df/astchunk-0.1.0.tar.gz", hash = "sha256:f4dff0ef8b3b3bcfeac363384db1e153f74d4c825dc2e35864abfab027713be4", size = 18093, upload-time = "2025-06-19T04:37:25.34Z" }
wheels = [ [package.metadata]
{ url = "https://files.pythonhosted.org/packages/be/84/5433ab0e933b572750cb16fd7edf3d6c7902b069461a22ec670042752a4d/astchunk-0.1.0-py3-none-any.whl", hash = "sha256:33ada9fc3620807fdda5846fa1948af463f281a60e0d43d4f3782b6dbb416d24", size = 15396, upload-time = "2025-06-19T04:37:23.87Z" }, requires-dist = [
{ name = "black", marker = "extra == 'dev'", specifier = ">=22.0.0" },
{ name = "flake8", marker = "extra == 'dev'", specifier = ">=5.0.0" },
{ name = "isort", marker = "extra == 'dev'", specifier = ">=5.10.0" },
{ name = "mypy", marker = "extra == 'dev'", specifier = ">=1.0.0" },
{ name = "myst-parser", marker = "extra == 'docs'", specifier = ">=0.18.0" },
{ name = "numpy", specifier = ">=1.20.0" },
{ name = "pre-commit", marker = "extra == 'dev'", specifier = ">=2.20.0" },
{ name = "pyrsistent", specifier = ">=0.18.0" },
{ name = "pytest", marker = "extra == 'dev'", specifier = ">=7.0.0" },
{ name = "pytest", marker = "extra == 'test'", specifier = ">=7.0.0" },
{ name = "pytest-cov", marker = "extra == 'dev'", specifier = ">=4.0.0" },
{ name = "pytest-cov", marker = "extra == 'test'", specifier = ">=4.0.0" },
{ name = "pytest-xdist", marker = "extra == 'test'", specifier = ">=2.5.0" },
{ name = "sphinx", marker = "extra == 'docs'", specifier = ">=5.0.0" },
{ name = "sphinx-rtd-theme", marker = "extra == 'docs'", specifier = ">=1.0.0" },
{ name = "tree-sitter", specifier = ">=0.20.0" },
{ name = "tree-sitter-c-sharp", specifier = ">=0.20.0" },
{ name = "tree-sitter-java", specifier = ">=0.20.0" },
{ name = "tree-sitter-python", specifier = ">=0.20.0" },
{ name = "tree-sitter-typescript", specifier = ">=0.20.0" },
] ]
provides-extras = ["dev", "docs", "test"]
[[package]] [[package]]
name = "asttokens" name = "asttokens"
@@ -1564,7 +1585,7 @@ name = "importlib-metadata"
version = "8.7.0" version = "8.7.0"
source = { registry = "https://pypi.org/simple" } source = { registry = "https://pypi.org/simple" }
dependencies = [ dependencies = [
{ name = "zipp" }, { name = "zipp", marker = "python_full_version < '3.10'" },
] ]
sdist = { url = "https://files.pythonhosted.org/packages/76/66/650a33bd90f786193e4de4b3ad86ea60b53c89b669a5c7be931fac31cdb0/importlib_metadata-8.7.0.tar.gz", hash = "sha256:d13b81ad223b890aa16c5471f2ac3056cf76c5f10f82d6f9292f0b415f389000", size = 56641, upload-time = "2025-04-27T15:29:01.736Z" } sdist = { url = "https://files.pythonhosted.org/packages/76/66/650a33bd90f786193e4de4b3ad86ea60b53c89b669a5c7be931fac31cdb0/importlib_metadata-8.7.0.tar.gz", hash = "sha256:d13b81ad223b890aa16c5471f2ac3056cf76c5f10f82d6f9292f0b415f389000", size = 56641, upload-time = "2025-04-27T15:29:01.736Z" }
wheels = [ wheels = [
@@ -2117,7 +2138,7 @@ wheels = [
[[package]] [[package]]
name = "leann-backend-diskann" name = "leann-backend-diskann"
version = "0.3.2" version = "0.3.3"
source = { editable = "packages/leann-backend-diskann" } source = { editable = "packages/leann-backend-diskann" }
dependencies = [ dependencies = [
{ name = "leann-core" }, { name = "leann-core" },
@@ -2129,14 +2150,14 @@ dependencies = [
[package.metadata] [package.metadata]
requires-dist = [ requires-dist = [
{ name = "leann-core", specifier = "==0.3.2" }, { name = "leann-core", specifier = "==0.3.3" },
{ name = "numpy" }, { name = "numpy" },
{ name = "protobuf", specifier = ">=3.19.0" }, { name = "protobuf", specifier = ">=3.19.0" },
] ]
[[package]] [[package]]
name = "leann-backend-hnsw" name = "leann-backend-hnsw"
version = "0.3.2" version = "0.3.3"
source = { editable = "packages/leann-backend-hnsw" } source = { editable = "packages/leann-backend-hnsw" }
dependencies = [ dependencies = [
{ name = "leann-core" }, { name = "leann-core" },
@@ -2149,7 +2170,7 @@ dependencies = [
[package.metadata] [package.metadata]
requires-dist = [ requires-dist = [
{ name = "leann-core", specifier = "==0.3.2" }, { name = "leann-core", specifier = "==0.3.3" },
{ name = "msgpack", specifier = ">=1.0.0" }, { name = "msgpack", specifier = ">=1.0.0" },
{ name = "numpy" }, { name = "numpy" },
{ name = "pyzmq", specifier = ">=23.0.0" }, { name = "pyzmq", specifier = ">=23.0.0" },
@@ -2157,7 +2178,7 @@ requires-dist = [
[[package]] [[package]]
name = "leann-core" name = "leann-core"
version = "0.3.2" version = "0.3.3"
source = { editable = "packages/leann-core" } source = { editable = "packages/leann-core" }
dependencies = [ dependencies = [
{ name = "accelerate" }, { name = "accelerate" },
@@ -2297,7 +2318,7 @@ test = [
[package.metadata] [package.metadata]
requires-dist = [ requires-dist = [
{ name = "astchunk", specifier = ">=0.1.0" }, { name = "astchunk", editable = "packages/astchunk-leann" },
{ name = "beautifulsoup4", marker = "extra == 'documents'", specifier = ">=4.13.0" }, { name = "beautifulsoup4", marker = "extra == 'documents'", specifier = ">=4.13.0" },
{ name = "black", marker = "extra == 'dev'", specifier = ">=23.0" }, { name = "black", marker = "extra == 'dev'", specifier = ">=23.0" },
{ name = "boto3" }, { name = "boto3" },