# LEANN Metadata Filtering Usage Guide

## Overview

Leann possesses metadata filtering capabilities that allow you to filter search results based on arbitrary metadata fields set during chunking. This feature enables use cases like spoiler-free book search, document filtering by date/type, code search by file type, and potentially much more.

## Basic Usage

### Adding Metadata to Your Documents

When building your index, add metadata to each text chunk:

```python
from leann.api import LeannBuilder

builder = LeannBuilder("hnsw")

# Add text with metadata
builder.add_text(
    text="Chapter 1: Alice falls down the rabbit hole",
    metadata={
        "chapter": 1,
        "character": "Alice",
        "themes": ["adventure", "curiosity"],
        "word_count": 150
    }
)

builder.build_index("alice_in_wonderland_index")
```

### Searching with Metadata Filters

Use the `metadata_filters` parameter in search calls:

```python
from leann.api import LeannSearcher

searcher = LeannSearcher("alice_in_wonderland_index")

# Search with filters
results = searcher.search(
    query="What happens to Alice?",
    top_k=10,
    metadata_filters={
        "chapter": {"<=": 5},           # Only chapters 1-5
        "spoiler_level": {"!=": "high"} # No high spoilers
    }
)
```

## Filter Syntax

### Basic Structure

```python
metadata_filters = {
    "field_name": {"operator": value},
    "another_field": {"operator": value}
}
```

### Supported Operators

#### Comparison Operators
- `"=="`: Equal to
- `"!="`: Not equal to
- `"<"`: Less than
- `"<="`: Less than or equal
- `">"`: Greater than
- `">="`: Greater than or equal

```python
# Examples
{"chapter": {"==": 1}}           # Exactly chapter 1
{"page": {">": 100}}            # Pages after 100
{"rating": {">=": 4.0}}         # Rating 4.0 or higher
{"word_count": {"<": 500}}      # Short passages
```

#### Membership Operators
- `"in"`: Value is in list
- `"not_in"`: Value is not in list

```python
# Examples
{"character": {"in": ["Alice", "Bob"]}}      # Alice OR Bob
{"genre": {"not_in": ["horror", "thriller"]}} # Exclude genres
{"tags": {"in": ["fiction", "adventure"]}}   # Any of these tags
```

#### String Operators
- `"contains"`: String contains substring
- `"starts_with"`: String starts with prefix
- `"ends_with"`: String ends with suffix

```python
# Examples
{"title": {"contains": "alice"}}        # Title contains "alice"
{"filename": {"ends_with": ".py"}}      # Python files
{"author": {"starts_with": "Dr."}}      # Authors with "Dr." prefix
```

#### Boolean Operators
- `"is_true"`: Field is truthy
- `"is_false"`: Field is falsy

```python
# Examples
{"is_published": {"is_true": True}}     # Published content
{"is_draft": {"is_false": False}}       # Not drafts
```

### Multiple Operators on Same Field

You can apply multiple operators to the same field (AND logic):

```python
metadata_filters = {
    "word_count": {
        ">=": 100,    # At least 100 words
        "<=": 500     # At most 500 words
    }
}
```

### Compound Filters

Multiple fields are combined with AND logic:

```python
metadata_filters = {
    "chapter": {"<=": 10},              # Up to chapter 10
    "character": {"==": "Alice"},       # About Alice
    "spoiler_level": {"!=": "high"}     # No major spoilers
}
```

## Use Case Examples

### 1. Spoiler-Free Book Search

```python
# Reader has only read up to chapter 5
def search_spoiler_free(query, max_chapter):
    return searcher.search(
        query=query,
        metadata_filters={
            "chapter": {"<=": max_chapter},
            "spoiler_level": {"in": ["none", "low"]}
        }
    )

results = search_spoiler_free("What happens to Alice?", max_chapter=5)
```

### 2. Document Management by Date

```python
# Find recent documents
recent_docs = searcher.search(
    query="project updates",
    metadata_filters={
        "date": {">=": "2024-01-01"},
        "document_type": {"==": "report"}
    }
)
```

### 3. Code Search by File Type

```python
# Search only Python files
python_code = searcher.search(
    query="authentication function",
    metadata_filters={
        "file_extension": {"==": ".py"},
        "lines_of_code": {"<": 100}
    }
)
```

### 4. Content Filtering by Audience

```python
# Age-appropriate content
family_content = searcher.search(
    query="adventure stories",
    metadata_filters={
        "age_rating": {"in": ["G", "PG"]},
        "content_warnings": {"not_in": ["violence", "adult_themes"]}
    }
)
```

### 5. Multi-Book Series Management

```python
# Search across first 3 books only
early_series = searcher.search(
    query="character development",
    metadata_filters={
        "series": {"==": "Harry Potter"},
        "book_number": {"<=": 3}
    }
)
```

## Running the Example

You can see metadata filtering in action with our spoiler-free book RAG example:

```bash
# Don't forget to set up the environment
uv venv
source .venv/bin/activate

# Set your OpenAI API key (required for embeddings, but you can update the example locally and use ollama instead)
export OPENAI_API_KEY="your-api-key-here"

# Run the spoiler-free book RAG example
uv run examples/spoiler_free_book_rag.py
```

This example demonstrates:
- Building an index with metadata (chapter numbers, characters, themes, locations)
- Searching with filters to avoid spoilers (e.g., only show results up to chapter 5)
- Different scenarios for readers at various points in the book

The example uses Alice's Adventures in Wonderland as sample data and shows how you can search for information without revealing plot points from later chapters.

## Advanced Patterns

### Custom Chunking with metadata

```python
def chunk_book_with_metadata(book_text, book_info):
    chunks = []

    for chapter_num, chapter_text in parse_chapters(book_text):
        # Extract entities, themes, etc.
        characters = extract_characters(chapter_text)
        themes = classify_themes(chapter_text)
        spoiler_level = assess_spoiler_level(chapter_text, chapter_num)

        # Create chunks with rich metadata
        for paragraph in split_paragraphs(chapter_text):
            chunks.append({
                "text": paragraph,
                "metadata": {
                    "book_title": book_info["title"],
                    "chapter": chapter_num,
                    "characters": characters,
                    "themes": themes,
                    "spoiler_level": spoiler_level,
                    "word_count": len(paragraph.split()),
                    "reading_level": calculate_reading_level(paragraph)
                }
            })

    return chunks
```

## Performance Considerations

### Efficient Filtering Strategies

1. **Post-search filtering**: Applies filters after vector search, which should be efficient for typical result sets (10-100 results).

2. **Metadata design**: Keep metadata fields simple and avoid deeply nested structures.

### Best Practices

1. **Consistent metadata schema**: Use consistent field names and value types across your documents.

2. **Reasonable metadata size**: Keep metadata reasonably sized to avoid storage overhead.

3. **Type consistency**: Use consistent data types for the same fields (e.g., always integers for chapter numbers).

4. **Index multiple granularities**: Consider chunking at different levels (paragraph, section, chapter) with appropriate metadata.

### Adding Metadata to Existing Indices

To add metadata filtering to existing indices, you'll need to rebuild them with metadata:

```python
# Read existing passages and add metadata
def add_metadata_to_existing_chunks(chunks):
    for chunk in chunks:
        # Extract or assign metadata based on content
        chunk["metadata"] = extract_metadata(chunk["text"])
    return chunks

# Rebuild index with metadata
enhanced_chunks = add_metadata_to_existing_chunks(existing_chunks)
builder = LeannBuilder("hnsw")
for chunk in enhanced_chunks:
    builder.add_text(chunk["text"], chunk["metadata"])
builder.build_index("enhanced_index")
```