push

chore
fix
2025-11-24 08:07:51 +00:00 · 2025-11-24 08:05:51 +00:00 · 2025-11-24 08:05:29 +00:00 · 2025-11-24 08:01:42 +00:00
2 changed files with 3 additions and 164 deletions
--- a/ISSUE_159_CONCLUSION.md
+++ b/ISSUE_159_CONCLUSION.md
@@ -1,110 +0,0 @@
-# Issue #159 Performance Analysis - Conclusion
-
-## Problem Summary
-User reported search times of 15-30 seconds instead of the ~2 seconds mentioned in the paper.
-
-**Configuration:**
- GPU: 4090×1
- Embedding Model: BAAI/bge-large-zh-v1.5 (~300M parameters)
- Data Size: 180MB text (~90K chunks)
- Backend: HNSW
- beam_width: 10
- Other parameters: Default values
-
-## Root Cause Analysis
-
-### 1. **Search Complexity Parameter**
-The **default `complexity` parameter is 64**, which is too high for achieving ~2 second search times with this configuration.
-
-**Test Results (Reproduced):**
- **Complexity 64 (default)**: **36.17 seconds** ❌
- **Complexity 32**: **2.49 seconds** ✅
- **Complexity 16**: **2.24 seconds** ✅ (Close to paper's ~2 seconds)
- **Complexity 8**: **1.67 seconds** ✅
-
-### 2. **beam_width Parameter**
-The `beam_width` parameter is **mainly for DiskANN backend**, not HNSW. Setting it to 10 has minimal/no effect on HNSW search performance.
-
-### 3. **Embedding Model Size**
-The paper uses a smaller embedding model (~100M parameters), while the user is using `BAAI/bge-large-zh-v1.5` (~300M parameters). This contributes to slower embedding computation during search, but the main bottleneck is the search complexity parameter.
-
-## Solution
-
-### **Recommended Fix: Reduce Search Complexity**
-
-To achieve search times close to ~2 seconds, use:
-
-```python
-from leann.api import LeannSearcher
-
-searcher = LeannSearcher(INDEX_PATH)
-results = searcher.search(
-    query="your query",
-    top_k=10,
-    complexity=16,  # or complexity=32 for slightly better accuracy
-    # beam_width parameter doesn't affect HNSW, can be ignored
-)
-```
-
-Or via CLI:
-```bash
-leann search your-index "your query" --complexity 16
-```
-
-### **Alternative Solutions**
-
-1. **Use DiskANN Backend** (Recommended by maintainer)
-   - DiskANN is faster for large datasets
-   - Better performance scaling
-   - `beam_width` parameter is relevant here
-   ```python
-   builder = LeannBuilder(backend_name="diskann")
-   ```
-
-2. **Use Smaller Embedding Model**
-   - Switch to a smaller model (~100M parameters) like the paper
-   - Faster embedding computation
-   - Example: `BAAI/bge-base-zh-v1.5` instead of `bge-large-zh-v1.5`
-
-3. **Disable Recomputation** (Trade storage for speed)
-   - Use `--no-recompute` flag
-   - Stores all embeddings (much larger storage)
-   - Faster search (no embedding recomputation)
-   ```bash
-   leann build your-index --no-recompute --no-compact
-   leann search your-index "query" --no-recompute
-   ```
-
-## Performance Comparison
-
-| Complexity | Search Time | Accuracy | Recommendation |
-|------------|-------------|----------|---------------|
-| 64 (default) | ~36s | Highest | ❌ Too slow |
-| 32 | ~2.5s | High | ✅ Good balance |
-| 16 | ~2.2s | Good | ✅ **Recommended** (matches paper) |
-| 8 | ~1.7s | Lower | ⚠️ May sacrifice accuracy |
-
-## Key Takeaways
-
-1. **The default `complexity=64` is optimized for accuracy, not speed**
-2. **For ~2 second search times, use `complexity=16` or `complexity=32`**
-3. **`beam_width` parameter is for DiskANN, not HNSW**
-4. **The paper's ~2 second results likely used:**
-   - Smaller embedding model (~100M params)
-   - Lower complexity (16-32)
-   - Possibly DiskANN backend
-
-## Verification
-
-The issue has been reproduced and verified. The test script `test_issue_159.py` demonstrates:
- Default complexity (64) results in ~36 second search times
- Reducing complexity to 16-32 achieves ~2 second search times
- This matches the user's reported issue and provides a clear solution
-
-## Next Steps
-
-1. ✅ Issue reproduced and root cause identified
-2. ✅ Solution provided (reduce complexity parameter)
-3. ⏳ User should test with `complexity=16` or `complexity=32`
-4. ⏳ Consider updating documentation to clarify complexity parameter trade-offs
-
--- a/benchmarks/issue_159.py
+++ b/benchmarks/issue_159.py
@@ -2,10 +2,9 @@
 """
 Test script to reproduce issue #159: Slow search performance
 Configuration:
- GPU: 4090×1
+- GPU: A10
 - embedding_model: BAAI/bge-large-zh-v1.5
 - data size: 180M text (~90K chunks)
- beam_width: 10 (though this is mainly for DiskANN, not HNSW)
 - backend: hnsw
 """

@@ -13,7 +12,7 @@ import os
 import time
 from pathlib import Path

-from leann.api import LeannBuilder, LeannSearcher, SearchResult
+from leann.api import LeannBuilder, LeannSearcher

 os.environ["LEANN_LOG_LEVEL"] = "DEBUG"

@@ -29,7 +28,7 @@ def generate_test_data(num_chunks=90000, chunk_size=2000):
    # 90K chunks * 2000 chars ≈ 180MB
    chunks = []
    base_text = (
-        "这是一个测试文档。LEANN是一个创新的向量数据库，通过图基选择性重计算实现97%的存储节省。"
+        "这是一个测试文档。LEANN是一个创新的向量数据库, 通过图基选择性重计算实现97%的存储节省。"
    )

    for i in range(num_chunks):
@@ -83,42 +82,6 @@ def test_search_performance():

    test_query = "LEANN向量数据库存储优化"

-    # Test with default complexity (64)
-    print("\n  Test 1: Default complexity (64) `1 ")
-    print(f"    Query: '{test_query}'")
-    start_time = time.time()
-    results: list[SearchResult] = searcher.search(test_query, top_k=10, complexity=64)
-    search_time = time.time() - start_time
-    print(f"    ✓ Search completed in {search_time:.2f} seconds")
-    print(f"    Results: {len(results)} items")
-
-    # Test with default complexity (64)
-    print("\n  Test 1: Default complexity (64)")
-    print(f"    Query: '{test_query}'")
-    start_time = time.time()
-    results = searcher.search(test_query, top_k=10, complexity=64)
-    search_time = time.time() - start_time
-    print(f"    ✓ Search completed in {search_time:.2f} seconds")
-    print(f"    Results: {len(results)} items")
-
-    # Test with lower complexity (32)
-    print("\n  Test 2: Lower complexity (32)")
-    print(f"    Query: '{test_query}'")
-    start_time = time.time()
-    results = searcher.search(test_query, top_k=10, complexity=32)
-    search_time = time.time() - start_time
-    print(f"    ✓ Search completed in {search_time:.2f} seconds")
-    print(f"    Results: {len(results)} items")
-
-    # Test with even lower complexity (16)
-    print("\n  Test 3: Lower complexity (16)")
-    print(f"    Query: '{test_query}'")
-    start_time = time.time()
-    results = searcher.search(test_query, top_k=10, complexity=16)
-    search_time = time.time() - start_time
-    print(f"    ✓ Search completed in {search_time:.2f} seconds")
-    print(f"    Results: {len(results)} items")
-
    # Test with minimal complexity (8)
    print("\n  Test 4: Minimal complexity (8)")
    print(f"    Query: '{test_query}'")
@@ -129,20 +92,6 @@ def test_search_performance():
    print(f"    Results: {len(results)} items")

    print("\n" + "=" * 80)
-    print("Performance Analysis:")
-    print("=" * 80)
-    print("\nKey Findings:")
-    print("1. beam_width parameter is mainly for DiskANN backend, not HNSW")
-    print("2. For HNSW, the main parameter affecting search speed is 'complexity'")
-    print("3. Lower complexity values (16-32) should provide faster search")
-    print("4. The paper mentions ~2 seconds, which likely uses:")
-    print("   - Smaller embedding model (~100M params vs 300M for bge-large)")
-    print("   - Lower complexity (16-32)")
-    print("   - Possibly DiskANN backend for better performance")
-    print("\nRecommendations:")
-    print("- Try complexity=16 or complexity=32 for faster search")
-    print("- Consider using DiskANN backend for better performance on large datasets")
-    print("- Or use a smaller embedding model if speed is critical")


 if __name__ == "__main__":
Author	SHA1	Message	Date
Andy Lee	ed15776564	push	2025-11-24 08:07:51 +00:00
Andy Lee	8d202b8b0e	chore	2025-11-24 08:05:51 +00:00
Andy Lee	9ac9eab48d	fix	2025-11-24 08:05:29 +00:00
Andy Lee	cd1d853a46	fix	2025-11-24 08:01:42 +00:00