* Add prompt template support and LM Studio SDK integration Features: - Prompt template support for embedding models (via --embedding-prompt-template) - LM Studio SDK integration for automatic context length detection - Hybrid token limit discovery (Ollama → LM Studio → Registry → Default) - Client-side token truncation to prevent silent failures - Automatic persistence of embedding_options to .meta.json Implementation: - Added _query_lmstudio_context_limit() with Node.js subprocess bridge - Modified compute_embeddings_openai() to apply prompt templates before truncation - Extended CLI with --embedding-prompt-template flag for build and search - URL detection for LM Studio (port 1234 or lmstudio/lm.studio keywords) - HTTP→WebSocket URL conversion for SDK compatibility Tests: - 60 passing tests across 5 test files - Comprehensive coverage of prompt templates, LM Studio integration, and token handling - Parametrized tests for maintainability and clarity * Add integration tests and fix LM Studio SDK bridge Features: - End-to-end integration tests for prompt template with EmbeddingGemma - Integration tests for hybrid token limit discovery mechanism - Tests verify real-world functionality with live services (LM Studio, Ollama) Fixes: - LM Studio SDK bridge now uses client.embedding.load() for embedding models - Fixed NODE_PATH resolution to include npm global modules - Fixed integration test to use WebSocket URL (ws://) for SDK bridge Tests: - test_prompt_template_e2e.py: 8 integration tests covering: - Prompt template prepending with LM Studio (EmbeddingGemma) - LM Studio SDK bridge for context length detection - Ollama dynamic token limit detection - Hybrid discovery fallback mechanism (registry, default) - All tests marked with @pytest.mark.integration for selective execution - Tests gracefully skip when services unavailable Documentation: - Updated tests/README.md with integration test section - Added prerequisites and running instructions - Documented that prompt templates are ONLY for EmbeddingGemma - Added integration marker to pyproject.toml Test Results: - All 8 integration tests passing with live services - Confirmed prompt templates work correctly with EmbeddingGemma - Verified LM Studio SDK bridge auto-detects context length (2048) - Validated hybrid token limit discovery across all backends * Add prompt template support to Ollama mode Extends prompt template functionality from OpenAI mode to Ollama for backend consistency. Changes: - Add provider_options parameter to compute_embeddings_ollama() - Apply prompt template before token truncation (lines 1005-1011) - Pass provider_options through compute_embeddings() call chain Tests: - test_ollama_embedding_with_prompt_template: Verifies templates work with Ollama - test_ollama_prompt_template_affects_embeddings: Confirms embeddings differ with/without template - Both tests pass with live Ollama service (2/2 passing) Usage: leann build --embedding-mode ollama --embedding-prompt-template "query: " ... * Fix LM Studio SDK bridge to respect JIT auto-evict settings Problem: SDK bridge called client.embedding.load() which loaded models into LM Studio memory and bypassed JIT auto-evict settings, causing duplicate model instances to accumulate. Root cause analysis (from Perplexity research): - Explicit SDK load() commands are treated as "pinned" models - JIT auto-evict only applies to models loaded reactively via API requests - SDK-loaded models remain in memory until explicitly unloaded Solutions implemented: 1. Add model.unload() after metadata query (line 243) - Load model temporarily to get context length - Unload immediately to hand control back to JIT system - Subsequent API requests trigger JIT load with auto-evict 2. Add token limit caching to prevent repeated SDK calls - Cache discovered limits in _token_limit_cache dict (line 48) - Key: (model_name, base_url), Value: token_limit - Prevents duplicate load/unload cycles within same process - Cache shared across all discovery methods (Ollama, SDK, registry) Tests: - TestTokenLimitCaching: 5 tests for cache behavior (integrated into test_token_truncation.py) - Manual testing confirmed no duplicate models in LM Studio after fix - All existing tests pass Impact: - Respects user's LM Studio JIT and auto-evict settings - Reduces model memory footprint - Faster subsequent builds (cached limits) * Document prompt template and LM Studio SDK features Added comprehensive documentation for new optional embedding features: Configuration Guide (docs/configuration-guide.md): - New section: "Optional Embedding Features" - Task-Specific Prompt Templates subsection: - Explains EmbeddingGemma use case with document/query prompts - CLI and Python API examples - Clear warnings about compatible vs incompatible models - References to GitHub issue #155 and HuggingFace blog - LM Studio Auto-Detection subsection: - Prerequisites (Node.js + @lmstudio/sdk) - How auto-detection works (4-step process) - Benefits and optional nature clearly stated FAQ (docs/faq.md): - FAQ #2: When should I use prompt templates? - DO/DON'T guidance with examples - Links to detailed configuration guide - FAQ #3: Why is LM Studio loading multiple copies? - Explains the JIT auto-evict fix - Troubleshooting steps if still seeing issues - FAQ #4: Do I need Node.js and @lmstudio/sdk? - Clarifies it's completely optional - Lists benefits if installed - Installation instructions Cross-references between documents for easy navigation between quick reference and detailed guides. * Add separate build/query template support for task-specific models Task-specific models like EmbeddingGemma require different templates for indexing vs searching. Store both templates at build time and auto-apply query template during search with backward compatibility. * Consolidate prompt template tests from 44 to 37 tests Merged redundant no-op tests, removed low-value implementation tests, consolidated parameterized CLI tests, and removed hanging over-mocked test. All tests pass with improved focus on behavioral testing. * Fix query template application in compute_query_embedding Query templates were only applied in the fallback code path, not when using the embedding server (default path). This meant stored query templates in index metadata were ignored during MCP and CLI searches. Changes: - Move template application to before any computation path (searcher_base.py:109-110) - Add comprehensive tests for both server and fallback paths - Consolidate tests into test_prompt_template_persistence.py Tests verify: - Template applied when using embedding server - Template applied in fallback path - Consistent behavior between both paths * Apply ruff formatting and fix linting issues - Remove unused imports - Fix import ordering - Remove unused variables - Apply code formatting * Fix CI test failures: mock OPENAI_API_KEY in tests Tests were failing in CI because compute_embeddings_openai() checks for OPENAI_API_KEY before using the mocked client. Added monkeypatch to set fake API key in test fixture.
401 lines
15 KiB
Python
401 lines
15 KiB
Python
"""End-to-end integration tests for prompt template and token limit features.
|
|
|
|
These tests verify real-world functionality with live services:
|
|
- OpenAI-compatible APIs (OpenAI, LM Studio) with prompt template support
|
|
- Ollama with dynamic token limit detection
|
|
- Hybrid token limit discovery mechanism
|
|
|
|
Run with: pytest tests/test_prompt_template_e2e.py -v -s
|
|
Skip if services unavailable: pytest tests/test_prompt_template_e2e.py -m "not integration"
|
|
|
|
Prerequisites:
|
|
1. LM Studio running with embedding model: http://localhost:1234
|
|
2. [Optional] Ollama running: ollama serve
|
|
3. [Optional] Ollama model: ollama pull nomic-embed-text
|
|
4. [Optional] Node.js + @lmstudio/sdk for context length detection
|
|
"""
|
|
|
|
import logging
|
|
import socket
|
|
|
|
import numpy as np
|
|
import pytest
|
|
import requests
|
|
from leann.embedding_compute import (
|
|
compute_embeddings_ollama,
|
|
compute_embeddings_openai,
|
|
get_model_token_limit,
|
|
)
|
|
|
|
# Test markers for conditional execution
|
|
pytestmark = pytest.mark.integration
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
def check_service_available(host: str, port: int, timeout: float = 2.0) -> bool:
|
|
"""Check if a service is available on the given host:port."""
|
|
try:
|
|
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
|
sock.settimeout(timeout)
|
|
result = sock.connect_ex((host, port))
|
|
sock.close()
|
|
return result == 0
|
|
except Exception:
|
|
return False
|
|
|
|
|
|
def check_ollama_available() -> bool:
|
|
"""Check if Ollama service is available."""
|
|
if not check_service_available("localhost", 11434):
|
|
return False
|
|
try:
|
|
response = requests.get("http://localhost:11434/api/tags", timeout=2.0)
|
|
return response.status_code == 200
|
|
except Exception:
|
|
return False
|
|
|
|
|
|
def check_lmstudio_available() -> bool:
|
|
"""Check if LM Studio service is available."""
|
|
if not check_service_available("localhost", 1234):
|
|
return False
|
|
try:
|
|
response = requests.get("http://localhost:1234/v1/models", timeout=2.0)
|
|
return response.status_code == 200
|
|
except Exception:
|
|
return False
|
|
|
|
|
|
def get_lmstudio_first_model() -> str:
|
|
"""Get the first available model from LM Studio."""
|
|
try:
|
|
response = requests.get("http://localhost:1234/v1/models", timeout=5.0)
|
|
data = response.json()
|
|
models = data.get("data", [])
|
|
if models:
|
|
return models[0]["id"]
|
|
except Exception:
|
|
pass
|
|
return None
|
|
|
|
|
|
class TestPromptTemplateOpenAI:
|
|
"""End-to-end tests for prompt template with OpenAI-compatible APIs (LM Studio)."""
|
|
|
|
@pytest.mark.skipif(
|
|
not check_lmstudio_available(), reason="LM Studio service not available on localhost:1234"
|
|
)
|
|
def test_lmstudio_embedding_with_prompt_template(self):
|
|
"""Test prompt templates with LM Studio using OpenAI-compatible API."""
|
|
model_name = get_lmstudio_first_model()
|
|
if not model_name:
|
|
pytest.skip("No models loaded in LM Studio")
|
|
|
|
texts = ["artificial intelligence", "machine learning"]
|
|
prompt_template = "search_query: "
|
|
|
|
# Get embeddings with prompt template via provider_options
|
|
provider_options = {"prompt_template": prompt_template}
|
|
embeddings = compute_embeddings_openai(
|
|
texts=texts,
|
|
model_name=model_name,
|
|
base_url="http://localhost:1234/v1",
|
|
api_key="lm-studio", # LM Studio doesn't require real key
|
|
provider_options=provider_options,
|
|
)
|
|
|
|
assert embeddings is not None
|
|
assert len(embeddings) == 2
|
|
assert all(isinstance(emb, np.ndarray) for emb in embeddings)
|
|
assert all(len(emb) > 0 for emb in embeddings)
|
|
|
|
logger.info(
|
|
f"✓ LM Studio embeddings with prompt template: {len(embeddings)} vectors, {len(embeddings[0])} dimensions"
|
|
)
|
|
|
|
@pytest.mark.skipif(not check_lmstudio_available(), reason="LM Studio service not available")
|
|
def test_lmstudio_prompt_template_affects_embeddings(self):
|
|
"""Verify that prompt templates actually change embedding values."""
|
|
model_name = get_lmstudio_first_model()
|
|
if not model_name:
|
|
pytest.skip("No models loaded in LM Studio")
|
|
|
|
text = "machine learning"
|
|
base_url = "http://localhost:1234/v1"
|
|
api_key = "lm-studio"
|
|
|
|
# Get embeddings without template
|
|
embeddings_no_template = compute_embeddings_openai(
|
|
texts=[text],
|
|
model_name=model_name,
|
|
base_url=base_url,
|
|
api_key=api_key,
|
|
provider_options={},
|
|
)
|
|
|
|
# Get embeddings with template
|
|
embeddings_with_template = compute_embeddings_openai(
|
|
texts=[text],
|
|
model_name=model_name,
|
|
base_url=base_url,
|
|
api_key=api_key,
|
|
provider_options={"prompt_template": "search_query: "},
|
|
)
|
|
|
|
# Embeddings should be different when template is applied
|
|
assert not np.allclose(embeddings_no_template[0], embeddings_with_template[0])
|
|
|
|
logger.info("✓ Prompt template changes embedding values as expected")
|
|
|
|
|
|
class TestPromptTemplateOllama:
|
|
"""End-to-end tests for prompt template with Ollama."""
|
|
|
|
@pytest.mark.skipif(
|
|
not check_ollama_available(), reason="Ollama service not available on localhost:11434"
|
|
)
|
|
def test_ollama_embedding_with_prompt_template(self):
|
|
"""Test prompt templates with Ollama using any available embedding model."""
|
|
# Get any available embedding model
|
|
try:
|
|
response = requests.get("http://localhost:11434/api/tags", timeout=2.0)
|
|
models = response.json().get("models", [])
|
|
|
|
embedding_models = []
|
|
for model in models:
|
|
name = model["name"]
|
|
base_name = name.split(":")[0]
|
|
if any(emb in base_name for emb in ["embed", "bge", "minilm", "e5", "nomic"]):
|
|
embedding_models.append(name)
|
|
|
|
if not embedding_models:
|
|
pytest.skip("No embedding models available in Ollama")
|
|
|
|
model_name = embedding_models[0]
|
|
|
|
texts = ["artificial intelligence", "machine learning"]
|
|
prompt_template = "search_query: "
|
|
|
|
# Get embeddings with prompt template via provider_options
|
|
provider_options = {"prompt_template": prompt_template}
|
|
embeddings = compute_embeddings_ollama(
|
|
texts=texts,
|
|
model_name=model_name,
|
|
is_build=False,
|
|
host="http://localhost:11434",
|
|
provider_options=provider_options,
|
|
)
|
|
|
|
assert embeddings is not None
|
|
assert len(embeddings) == 2
|
|
assert all(isinstance(emb, np.ndarray) for emb in embeddings)
|
|
assert all(len(emb) > 0 for emb in embeddings)
|
|
|
|
logger.info(
|
|
f"✓ Ollama embeddings with prompt template: {len(embeddings)} vectors, {len(embeddings[0])} dimensions"
|
|
)
|
|
|
|
except Exception as e:
|
|
pytest.skip(f"Could not test Ollama prompt template: {e}")
|
|
|
|
@pytest.mark.skipif(not check_ollama_available(), reason="Ollama service not available")
|
|
def test_ollama_prompt_template_affects_embeddings(self):
|
|
"""Verify that prompt templates actually change embedding values with Ollama."""
|
|
# Get any available embedding model
|
|
try:
|
|
response = requests.get("http://localhost:11434/api/tags", timeout=2.0)
|
|
models = response.json().get("models", [])
|
|
|
|
embedding_models = []
|
|
for model in models:
|
|
name = model["name"]
|
|
base_name = name.split(":")[0]
|
|
if any(emb in base_name for emb in ["embed", "bge", "minilm", "e5", "nomic"]):
|
|
embedding_models.append(name)
|
|
|
|
if not embedding_models:
|
|
pytest.skip("No embedding models available in Ollama")
|
|
|
|
model_name = embedding_models[0]
|
|
text = "machine learning"
|
|
host = "http://localhost:11434"
|
|
|
|
# Get embeddings without template
|
|
embeddings_no_template = compute_embeddings_ollama(
|
|
texts=[text], model_name=model_name, is_build=False, host=host, provider_options={}
|
|
)
|
|
|
|
# Get embeddings with template
|
|
embeddings_with_template = compute_embeddings_ollama(
|
|
texts=[text],
|
|
model_name=model_name,
|
|
is_build=False,
|
|
host=host,
|
|
provider_options={"prompt_template": "search_query: "},
|
|
)
|
|
|
|
# Embeddings should be different when template is applied
|
|
assert not np.allclose(embeddings_no_template[0], embeddings_with_template[0])
|
|
|
|
logger.info("✓ Ollama prompt template changes embedding values as expected")
|
|
|
|
except Exception as e:
|
|
pytest.skip(f"Could not test Ollama prompt template: {e}")
|
|
|
|
|
|
class TestLMStudioSDK:
|
|
"""End-to-end tests for LM Studio SDK integration."""
|
|
|
|
@pytest.mark.skipif(not check_lmstudio_available(), reason="LM Studio service not available")
|
|
def test_lmstudio_model_listing(self):
|
|
"""Test that we can list models from LM Studio."""
|
|
try:
|
|
response = requests.get("http://localhost:1234/v1/models", timeout=5.0)
|
|
assert response.status_code == 200
|
|
|
|
data = response.json()
|
|
assert "data" in data
|
|
|
|
models = data["data"]
|
|
logger.info(f"✓ LM Studio models available: {len(models)}")
|
|
|
|
if models:
|
|
logger.info(f" First model: {models[0].get('id', 'unknown')}")
|
|
except Exception as e:
|
|
pytest.skip(f"LM Studio API error: {e}")
|
|
|
|
@pytest.mark.skipif(not check_lmstudio_available(), reason="LM Studio service not available")
|
|
def test_lmstudio_sdk_context_length_detection(self):
|
|
"""Test context length detection via LM Studio SDK bridge (requires Node.js + SDK)."""
|
|
model_name = get_lmstudio_first_model()
|
|
if not model_name:
|
|
pytest.skip("No models loaded in LM Studio")
|
|
|
|
try:
|
|
from leann.embedding_compute import _query_lmstudio_context_limit
|
|
|
|
# SDK requires WebSocket URL (ws://)
|
|
context_length = _query_lmstudio_context_limit(
|
|
model_name=model_name, base_url="ws://localhost:1234"
|
|
)
|
|
|
|
if context_length is None:
|
|
logger.warning(
|
|
"⚠ LM Studio SDK bridge returned None (Node.js or SDK may not be available)"
|
|
)
|
|
pytest.skip("Node.js or @lmstudio/sdk not available - SDK bridge unavailable")
|
|
else:
|
|
assert context_length > 0
|
|
logger.info(
|
|
f"✓ LM Studio context length detected via SDK: {context_length} for {model_name}"
|
|
)
|
|
|
|
except ImportError:
|
|
pytest.skip("_query_lmstudio_context_limit not implemented yet")
|
|
except Exception as e:
|
|
logger.error(f"LM Studio SDK test error: {e}")
|
|
raise
|
|
|
|
|
|
class TestOllamaTokenLimit:
|
|
"""End-to-end tests for Ollama token limit discovery."""
|
|
|
|
@pytest.mark.skipif(not check_ollama_available(), reason="Ollama service not available")
|
|
def test_ollama_token_limit_detection(self):
|
|
"""Test dynamic token limit detection from Ollama /api/show endpoint."""
|
|
# Get any available embedding model
|
|
try:
|
|
response = requests.get("http://localhost:11434/api/tags", timeout=2.0)
|
|
models = response.json().get("models", [])
|
|
|
|
embedding_models = []
|
|
for model in models:
|
|
name = model["name"]
|
|
base_name = name.split(":")[0]
|
|
if any(emb in base_name for emb in ["embed", "bge", "minilm", "e5", "nomic"]):
|
|
embedding_models.append(name)
|
|
|
|
if not embedding_models:
|
|
pytest.skip("No embedding models available in Ollama")
|
|
|
|
test_model = embedding_models[0]
|
|
|
|
# Test token limit detection
|
|
limit = get_model_token_limit(model_name=test_model, base_url="http://localhost:11434")
|
|
|
|
assert limit > 0
|
|
logger.info(f"✓ Ollama token limit detected: {limit} for {test_model}")
|
|
|
|
except Exception as e:
|
|
pytest.skip(f"Could not test Ollama token detection: {e}")
|
|
|
|
|
|
class TestHybridTokenLimit:
|
|
"""End-to-end tests for hybrid token limit discovery mechanism."""
|
|
|
|
def test_hybrid_discovery_registry_fallback(self):
|
|
"""Test fallback to static registry for known OpenAI models."""
|
|
# Use a known OpenAI model (should be in registry)
|
|
limit = get_model_token_limit(
|
|
model_name="text-embedding-3-small",
|
|
base_url="http://fake-server:9999", # Fake URL to force registry lookup
|
|
)
|
|
|
|
# text-embedding-3-small should have 8192 in registry
|
|
assert limit == 8192
|
|
logger.info(f"✓ Hybrid discovery (registry fallback): {limit} tokens")
|
|
|
|
def test_hybrid_discovery_default_fallback(self):
|
|
"""Test fallback to safe default for completely unknown models."""
|
|
limit = get_model_token_limit(
|
|
model_name="completely-unknown-model-xyz-12345",
|
|
base_url="http://fake-server:9999",
|
|
default=512,
|
|
)
|
|
|
|
# Should get the specified default
|
|
assert limit == 512
|
|
logger.info(f"✓ Hybrid discovery (default fallback): {limit} tokens")
|
|
|
|
@pytest.mark.skipif(not check_ollama_available(), reason="Ollama service not available")
|
|
def test_hybrid_discovery_ollama_dynamic_first(self):
|
|
"""Test that Ollama models use dynamic discovery first."""
|
|
# Get any available embedding model
|
|
try:
|
|
response = requests.get("http://localhost:11434/api/tags", timeout=2.0)
|
|
models = response.json().get("models", [])
|
|
|
|
embedding_models = []
|
|
for model in models:
|
|
name = model["name"]
|
|
base_name = name.split(":")[0]
|
|
if any(emb in base_name for emb in ["embed", "bge", "minilm", "e5", "nomic"]):
|
|
embedding_models.append(name)
|
|
|
|
if not embedding_models:
|
|
pytest.skip("No embedding models available in Ollama")
|
|
|
|
test_model = embedding_models[0]
|
|
|
|
# Should query Ollama /api/show dynamically
|
|
limit = get_model_token_limit(model_name=test_model, base_url="http://localhost:11434")
|
|
|
|
assert limit > 0
|
|
logger.info(f"✓ Hybrid discovery (Ollama dynamic): {limit} tokens for {test_model}")
|
|
|
|
except Exception as e:
|
|
pytest.skip(f"Could not test hybrid Ollama discovery: {e}")
|
|
|
|
|
|
if __name__ == "__main__":
|
|
print("\n" + "=" * 70)
|
|
print("INTEGRATION TEST SUITE - Real Service Testing")
|
|
print("=" * 70)
|
|
print("\nThese tests require live services:")
|
|
print(" • LM Studio: http://localhost:1234 (with embedding model loaded)")
|
|
print(" • [Optional] Ollama: http://localhost:11434")
|
|
print(" • [Optional] Node.js + @lmstudio/sdk for SDK bridge tests")
|
|
print("\nRun with: pytest tests/test_prompt_template_e2e.py -v -s")
|
|
print("=" * 70 + "\n")
|