Files

Gabriel Dehan 31b4973141 Metadata filtering feature (#75 )

* Metadata filtering initial version

* Metadata filtering initial version

* Fixes linter issues

* Cleanup code

* Clean up and readme

* Fix after review

* Use UV in example

* Merge main into feature/metadata-filtering

2025-08-20 19:57:56 -07:00

7.9 KiB

Raw Permalink Blame History

LEANN Metadata Filtering Usage Guide

Overview

Leann possesses metadata filtering capabilities that allow you to filter search results based on arbitrary metadata fields set during chunking. This feature enables use cases like spoiler-free book search, document filtering by date/type, code search by file type, and potentially much more.

Basic Usage

Adding Metadata to Your Documents

When building your index, add metadata to each text chunk:

from leann.api import LeannBuilder

builder = LeannBuilder("hnsw")

# Add text with metadata
builder.add_text(
    text="Chapter 1: Alice falls down the rabbit hole",
    metadata={
        "chapter": 1,
        "character": "Alice",
        "themes": ["adventure", "curiosity"],
        "word_count": 150
    }
)

builder.build_index("alice_in_wonderland_index")

Searching with Metadata Filters

Use the metadata_filters parameter in search calls:

from leann.api import LeannSearcher

searcher = LeannSearcher("alice_in_wonderland_index")

# Search with filters
results = searcher.search(
    query="What happens to Alice?",
    top_k=10,
    metadata_filters={
        "chapter": {"<=": 5},           # Only chapters 1-5
        "spoiler_level": {"!=": "high"} # No high spoilers
    }
)

Filter Syntax

Basic Structure

metadata_filters = {
    "field_name": {"operator": value},
    "another_field": {"operator": value}
}

Supported Operators

Comparison Operators

"==": Equal to
"!=": Not equal to
"<": Less than
"<=": Less than or equal
">": Greater than
">=": Greater than or equal

# Examples
{"chapter": {"==": 1}}           # Exactly chapter 1
{"page": {">": 100}}            # Pages after 100
{"rating": {">=": 4.0}}         # Rating 4.0 or higher
{"word_count": {"<": 500}}      # Short passages

Membership Operators

"in": Value is in list
"not_in": Value is not in list

# Examples
{"character": {"in": ["Alice", "Bob"]}}      # Alice OR Bob
{"genre": {"not_in": ["horror", "thriller"]}} # Exclude genres
{"tags": {"in": ["fiction", "adventure"]}}   # Any of these tags

String Operators

"contains": String contains substring
"starts_with": String starts with prefix
"ends_with": String ends with suffix

# Examples
{"title": {"contains": "alice"}}        # Title contains "alice"
{"filename": {"ends_with": ".py"}}      # Python files
{"author": {"starts_with": "Dr."}}      # Authors with "Dr." prefix

Boolean Operators

"is_true": Field is truthy
"is_false": Field is falsy

# Examples
{"is_published": {"is_true": True}}     # Published content
{"is_draft": {"is_false": False}}       # Not drafts

Multiple Operators on Same Field

You can apply multiple operators to the same field (AND logic):

metadata_filters = {
    "word_count": {
        ">=": 100,    # At least 100 words
        "<=": 500     # At most 500 words
    }
}

Compound Filters

Multiple fields are combined with AND logic:

metadata_filters = {
    "chapter": {"<=": 10},              # Up to chapter 10
    "character": {"==": "Alice"},       # About Alice
    "spoiler_level": {"!=": "high"}     # No major spoilers
}

Use Case Examples

1. Spoiler-Free Book Search

# Reader has only read up to chapter 5
def search_spoiler_free(query, max_chapter):
    return searcher.search(
        query=query,
        metadata_filters={
            "chapter": {"<=": max_chapter},
            "spoiler_level": {"in": ["none", "low"]}
        }
    )

results = search_spoiler_free("What happens to Alice?", max_chapter=5)

2. Document Management by Date

# Find recent documents
recent_docs = searcher.search(
    query="project updates",
    metadata_filters={
        "date": {">=": "2024-01-01"},
        "document_type": {"==": "report"}
    }
)

3. Code Search by File Type

# Search only Python files
python_code = searcher.search(
    query="authentication function",
    metadata_filters={
        "file_extension": {"==": ".py"},
        "lines_of_code": {"<": 100}
    }
)

4. Content Filtering by Audience

# Age-appropriate content
family_content = searcher.search(
    query="adventure stories",
    metadata_filters={
        "age_rating": {"in": ["G", "PG"]},
        "content_warnings": {"not_in": ["violence", "adult_themes"]}
    }
)

5. Multi-Book Series Management

# Search across first 3 books only
early_series = searcher.search(
    query="character development",
    metadata_filters={
        "series": {"==": "Harry Potter"},
        "book_number": {"<=": 3}
    }
)

Running the Example

You can see metadata filtering in action with our spoiler-free book RAG example:

# Don't forget to set up the environment
uv venv
source .venv/bin/activate

# Set your OpenAI API key (required for embeddings, but you can update the example locally and use ollama instead)
export OPENAI_API_KEY="your-api-key-here"

# Run the spoiler-free book RAG example
uv run examples/spoiler_free_book_rag.py

This example demonstrates:

Building an index with metadata (chapter numbers, characters, themes, locations)
Searching with filters to avoid spoilers (e.g., only show results up to chapter 5)
Different scenarios for readers at various points in the book

The example uses Alice's Adventures in Wonderland as sample data and shows how you can search for information without revealing plot points from later chapters.

Advanced Patterns

Custom Chunking with metadata

def chunk_book_with_metadata(book_text, book_info):
    chunks = []

    for chapter_num, chapter_text in parse_chapters(book_text):
        # Extract entities, themes, etc.
        characters = extract_characters(chapter_text)
        themes = classify_themes(chapter_text)
        spoiler_level = assess_spoiler_level(chapter_text, chapter_num)

        # Create chunks with rich metadata
        for paragraph in split_paragraphs(chapter_text):
            chunks.append({
                "text": paragraph,
                "metadata": {
                    "book_title": book_info["title"],
                    "chapter": chapter_num,
                    "characters": characters,
                    "themes": themes,
                    "spoiler_level": spoiler_level,
                    "word_count": len(paragraph.split()),
                    "reading_level": calculate_reading_level(paragraph)
                }
            })

    return chunks

Performance Considerations

Efficient Filtering Strategies

Post-search filtering: Applies filters after vector search, which should be efficient for typical result sets (10-100 results).
Metadata design: Keep metadata fields simple and avoid deeply nested structures.

Best Practices

Consistent metadata schema: Use consistent field names and value types across your documents.
Reasonable metadata size: Keep metadata reasonably sized to avoid storage overhead.
Type consistency: Use consistent data types for the same fields (e.g., always integers for chapter numbers).
Index multiple granularities: Consider chunking at different levels (paragraph, section, chapter) with appropriate metadata.

Adding Metadata to Existing Indices

To add metadata filtering to existing indices, you'll need to rebuild them with metadata:

# Read existing passages and add metadata
def add_metadata_to_existing_chunks(chunks):
    for chunk in chunks:
        # Extract or assign metadata based on content
        chunk["metadata"] = extract_metadata(chunk["text"])
    return chunks

# Rebuild index with metadata
enhanced_chunks = add_metadata_to_existing_chunks(existing_chunks)
builder = LeannBuilder("hnsw")
for chunk in enhanced_chunks:
    builder.add_text(chunk["text"], chunk["metadata"])
builder.build_index("enhanced_index")

7.9 KiB Raw Permalink Blame History