- Add automatic detection for normalized embedding models (OpenAI, Voyage AI, Cohere) - Automatically set distance_metric='cosine' for normalized embeddings - Add warnings when using non-optimal distance metrics - Implement manual L2 normalization in HNSW backend (custom Faiss build lacks normalize_L2) - Fix DiskANN zmq_port compatibility with lazy loading strategy - Add documentation for normalized embeddings feature This fixes the low accuracy issue when using OpenAI text-embedding-3-small model with default MIPS metric.
2.4 KiB
Normalized Embeddings Support in LEANN
LEANN now automatically detects normalized embedding models and sets the appropriate distance metric for optimal performance.
What are Normalized Embeddings?
Normalized embeddings are vectors with L2 norm = 1 (unit vectors). These embeddings are optimized for cosine similarity rather than Maximum Inner Product Search (MIPS).
Automatic Detection
When you create a LeannBuilder instance with a normalized embedding model, LEANN will:
- Automatically set
distance_metric="cosine"if not specified - Show a warning if you manually specify a different distance metric
- Provide optimal search performance with the correct metric
Supported Normalized Embedding Models
OpenAI
All OpenAI text embedding models are normalized:
text-embedding-ada-002text-embedding-3-smalltext-embedding-3-large
Voyage AI
All Voyage AI embedding models are normalized:
voyage-2voyage-3voyage-large-2voyage-multilingual-2voyage-code-2
Cohere
All Cohere embedding models are normalized:
embed-english-v3.0embed-multilingual-v3.0embed-english-light-v3.0embed-multilingual-light-v3.0
Example Usage
from leann.api import LeannBuilder
# Automatic detection - will use cosine distance
builder = LeannBuilder(
backend_name="hnsw",
embedding_model="text-embedding-3-small",
embedding_mode="openai"
)
# Warning: Detected normalized embeddings model 'text-embedding-3-small'...
# Automatically setting distance_metric='cosine'
# Manual override (not recommended)
builder = LeannBuilder(
backend_name="hnsw",
embedding_model="text-embedding-3-small",
embedding_mode="openai",
distance_metric="mips" # Will show warning
)
# Warning: Using 'mips' distance metric with normalized embeddings...
Non-Normalized Embeddings
Models like facebook/contriever and other sentence-transformers models that are not normalized will continue to use MIPS by default, which is optimal for them.
Why This Matters
Using the wrong distance metric with normalized embeddings can lead to:
- Poor search quality due to HNSW's early termination with narrow score ranges
- Incorrect ranking of search results
- Suboptimal performance compared to using the correct metric
For more details on why this happens, see our analysis of OpenAI embeddings with MIPS.