Metadata filtering feature (#75)

* Metadata filtering initial version * Metadata filtering initial version * Fixes linter issues * Cleanup code * Clean up and readme * Fix after review * Use UV in example * Merge main into feature/metadata-filtering
2025-08-21 04:57:56 +02:00
parent dde2221513
commit 31b4973141
8 changed files with 2770 additions and 1231 deletions
--- a/docs/metadata_filtering.md
+++ b/docs/metadata_filtering.md
@@ -0,0 +1,300 @@
+# LEANN Metadata Filtering Usage Guide
+
+## Overview
+
+Leann possesses metadata filtering capabilities that allow you to filter search results based on arbitrary metadata fields set during chunking. This feature enables use cases like spoiler-free book search, document filtering by date/type, code search by file type, and potentially much more.
+
+## Basic Usage
+
+### Adding Metadata to Your Documents
+
+When building your index, add metadata to each text chunk:
+
+```python
+from leann.api import LeannBuilder
+
+builder = LeannBuilder("hnsw")
+
+# Add text with metadata
+builder.add_text(
+    text="Chapter 1: Alice falls down the rabbit hole",
+    metadata={
+        "chapter": 1,
+        "character": "Alice",
+        "themes": ["adventure", "curiosity"],
+        "word_count": 150
+    }
+)
+
+builder.build_index("alice_in_wonderland_index")
+```
+
+### Searching with Metadata Filters
+
+Use the `metadata_filters` parameter in search calls:
+
+```python
+from leann.api import LeannSearcher
+
+searcher = LeannSearcher("alice_in_wonderland_index")
+
+# Search with filters
+results = searcher.search(
+    query="What happens to Alice?",
+    top_k=10,
+    metadata_filters={
+        "chapter": {"<=": 5},           # Only chapters 1-5
+        "spoiler_level": {"!=": "high"} # No high spoilers
+    }
+)
+```
+
+## Filter Syntax
+
+### Basic Structure
+
+```python
+metadata_filters = {
+    "field_name": {"operator": value},
+    "another_field": {"operator": value}
+}
+```
+
+### Supported Operators
+
+#### Comparison Operators
+- `"=="`: Equal to
+- `"!="`: Not equal to
+- `"<"`: Less than
+- `"<="`: Less than or equal
+- `">"`: Greater than
+- `">="`: Greater than or equal
+
+```python
+# Examples
+{"chapter": {"==": 1}}           # Exactly chapter 1
+{"page": {">": 100}}            # Pages after 100
+{"rating": {">=": 4.0}}         # Rating 4.0 or higher
+{"word_count": {"<": 500}}      # Short passages
+```
+
+#### Membership Operators
+- `"in"`: Value is in list
+- `"not_in"`: Value is not in list
+
+```python
+# Examples
+{"character": {"in": ["Alice", "Bob"]}}      # Alice OR Bob
+{"genre": {"not_in": ["horror", "thriller"]}} # Exclude genres
+{"tags": {"in": ["fiction", "adventure"]}}   # Any of these tags
+```
+
+#### String Operators
+- `"contains"`: String contains substring
+- `"starts_with"`: String starts with prefix
+- `"ends_with"`: String ends with suffix
+
+```python
+# Examples
+{"title": {"contains": "alice"}}        # Title contains "alice"
+{"filename": {"ends_with": ".py"}}      # Python files
+{"author": {"starts_with": "Dr."}}      # Authors with "Dr." prefix
+```
+
+#### Boolean Operators
+- `"is_true"`: Field is truthy
+- `"is_false"`: Field is falsy
+
+```python
+# Examples
+{"is_published": {"is_true": True}}     # Published content
+{"is_draft": {"is_false": False}}       # Not drafts
+```
+
+### Multiple Operators on Same Field
+
+You can apply multiple operators to the same field (AND logic):
+
+```python
+metadata_filters = {
+    "word_count": {
+        ">=": 100,    # At least 100 words
+        "<=": 500     # At most 500 words
+    }
+}
+```
+
+### Compound Filters
+
+Multiple fields are combined with AND logic:
+
+```python
+metadata_filters = {
+    "chapter": {"<=": 10},              # Up to chapter 10
+    "character": {"==": "Alice"},       # About Alice
+    "spoiler_level": {"!=": "high"}     # No major spoilers
+}
+```
+
+## Use Case Examples
+
+### 1. Spoiler-Free Book Search
+
+```python
+# Reader has only read up to chapter 5
+def search_spoiler_free(query, max_chapter):
+    return searcher.search(
+        query=query,
+        metadata_filters={
+            "chapter": {"<=": max_chapter},
+            "spoiler_level": {"in": ["none", "low"]}
+        }
+    )
+
+results = search_spoiler_free("What happens to Alice?", max_chapter=5)
+```
+
+### 2. Document Management by Date
+
+```python
+# Find recent documents
+recent_docs = searcher.search(
+    query="project updates",
+    metadata_filters={
+        "date": {">=": "2024-01-01"},
+        "document_type": {"==": "report"}
+    }
+)
+```
+
+### 3. Code Search by File Type
+
+```python
+# Search only Python files
+python_code = searcher.search(
+    query="authentication function",
+    metadata_filters={
+        "file_extension": {"==": ".py"},
+        "lines_of_code": {"<": 100}
+    }
+)
+```
+
+### 4. Content Filtering by Audience
+
+```python
+# Age-appropriate content
+family_content = searcher.search(
+    query="adventure stories",
+    metadata_filters={
+        "age_rating": {"in": ["G", "PG"]},
+        "content_warnings": {"not_in": ["violence", "adult_themes"]}
+    }
+)
+```
+
+### 5. Multi-Book Series Management
+
+```python
+# Search across first 3 books only
+early_series = searcher.search(
+    query="character development",
+    metadata_filters={
+        "series": {"==": "Harry Potter"},
+        "book_number": {"<=": 3}
+    }
+)
+```
+
+## Running the Example
+
+You can see metadata filtering in action with our spoiler-free book RAG example:
+
+```bash
+# Don't forget to set up the environment
+uv venv
+source .venv/bin/activate
+
+# Set your OpenAI API key (required for embeddings, but you can update the example locally and use ollama instead)
+export OPENAI_API_KEY="your-api-key-here"
+
+# Run the spoiler-free book RAG example
+uv run examples/spoiler_free_book_rag.py
+```
+
+This example demonstrates:
+- Building an index with metadata (chapter numbers, characters, themes, locations)
+- Searching with filters to avoid spoilers (e.g., only show results up to chapter 5)
+- Different scenarios for readers at various points in the book
+
+The example uses Alice's Adventures in Wonderland as sample data and shows how you can search for information without revealing plot points from later chapters.
+
+## Advanced Patterns
+
+### Custom Chunking with metadata
+
+```python
+def chunk_book_with_metadata(book_text, book_info):
+    chunks = []
+
+    for chapter_num, chapter_text in parse_chapters(book_text):
+        # Extract entities, themes, etc.
+        characters = extract_characters(chapter_text)
+        themes = classify_themes(chapter_text)
+        spoiler_level = assess_spoiler_level(chapter_text, chapter_num)
+
+        # Create chunks with rich metadata
+        for paragraph in split_paragraphs(chapter_text):
+            chunks.append({
+                "text": paragraph,
+                "metadata": {
+                    "book_title": book_info["title"],
+                    "chapter": chapter_num,
+                    "characters": characters,
+                    "themes": themes,
+                    "spoiler_level": spoiler_level,
+                    "word_count": len(paragraph.split()),
+                    "reading_level": calculate_reading_level(paragraph)
+                }
+            })
+
+    return chunks
+```
+
+## Performance Considerations
+
+### Efficient Filtering Strategies
+
+1. **Post-search filtering**: Applies filters after vector search, which should be efficient for typical result sets (10-100 results).
+
+2. **Metadata design**: Keep metadata fields simple and avoid deeply nested structures.
+
+### Best Practices
+
+1. **Consistent metadata schema**: Use consistent field names and value types across your documents.
+
+2. **Reasonable metadata size**: Keep metadata reasonably sized to avoid storage overhead.
+
+3. **Type consistency**: Use consistent data types for the same fields (e.g., always integers for chapter numbers).
+
+4. **Index multiple granularities**: Consider chunking at different levels (paragraph, section, chapter) with appropriate metadata.
+
+### Adding Metadata to Existing Indices
+
+To add metadata filtering to existing indices, you'll need to rebuild them with metadata:
+
+```python
+# Read existing passages and add metadata
+def add_metadata_to_existing_chunks(chunks):
+    for chunk in chunks:
+        # Extract or assign metadata based on content
+        chunk["metadata"] = extract_metadata(chunk["text"])
+    return chunks
+
+# Rebuild index with metadata
+enhanced_chunks = add_metadata_to_existing_chunks(existing_chunks)
+builder = LeannBuilder("hnsw")
+for chunk in enhanced_chunks:
+    builder.add_text(chunk["text"], chunk["metadata"])
+builder.build_index("enhanced_index")
+```