Files
LEANN/docs/normalized_embeddings.md
yichuan520030910320 6f5d5e4a77 fix some readme
2025-07-27 21:50:09 -07:00

2.4 KiB

Normalized Embeddings Support in LEANN

LEANN now automatically detects normalized embedding models and sets the appropriate distance metric for optimal performance.

What are Normalized Embeddings?

Normalized embeddings are vectors with L2 norm = 1 (unit vectors). These embeddings are optimized for cosine similarity rather than Maximum Inner Product Search (MIPS).

Automatic Detection

When you create a LeannBuilder instance with a normalized embedding model, LEANN will:

  1. Automatically set distance_metric="cosine" if not specified
  2. Show a warning if you manually specify a different distance metric
  3. Provide optimal search performance with the correct metric

Supported Normalized Embedding Models

OpenAI

All OpenAI text embedding models are normalized:

  • text-embedding-ada-002
  • text-embedding-3-small
  • text-embedding-3-large

Voyage AI

All Voyage AI embedding models are normalized:

  • voyage-2
  • voyage-3
  • voyage-large-2
  • voyage-multilingual-2
  • voyage-code-2

Cohere

All Cohere embedding models are normalized:

  • embed-english-v3.0
  • embed-multilingual-v3.0
  • embed-english-light-v3.0
  • embed-multilingual-light-v3.0

Example Usage

from leann.api import LeannBuilder

# Automatic detection - will use cosine distance
builder = LeannBuilder(
    backend_name="hnsw",
    embedding_model="text-embedding-3-small",
    embedding_mode="openai"
)
# Warning: Detected normalized embeddings model 'text-embedding-3-small'...
# Automatically setting distance_metric='cosine'

# Manual override (not recommended)
builder = LeannBuilder(
    backend_name="hnsw",
    embedding_model="text-embedding-3-small",
    embedding_mode="openai",
    distance_metric="mips"  # Will show warning
)
# Warning: Using 'mips' distance metric with normalized embeddings...

Non-Normalized Embeddings

Models like facebook/contriever and other sentence-transformers models that are not normalized will continue to use MIPS by default, which is optimal for them.

Why This Matters

Using the wrong distance metric with normalized embeddings can lead to:

  • Poor search quality due to HNSW's early termination with narrow score ranges
  • Incorrect ranking of search results
  • Suboptimal performance compared to using the correct metric

For more details on why this happens, see our analysis of OpenAI embeddings with MIPS.