LEANN

Author	SHA1	Message	Date
Andy Lee	a6dad47280	fix: address root cause of test hanging - improper ZMQ/C++ resource cleanup Fixed the actual root cause instead of just masking it in tests: 1. Root Problem: - C++ side's ZmqDistanceComputer creates ZMQ connections but doesn't clean them - Python 3.9/3.13 are more sensitive to cleanup timing during shutdown 2. Core Fixes in SearcherBase and LeannSearcher: - Added cleanup() method to BaseSearcher that cleans ZMQ and embedding server - LeannSearcher.cleanup() now also handles ZMQ context cleanup - Both HNSW and DiskANN searchers now properly delete C++ index objects 3. Backend-Specific Cleanup: - HNSWSearcher.cleanup(): Deletes self.index to trigger C++ destructors - DiskannSearcher.cleanup(): Deletes self._index and resets state - Both force garbage collection after deletion 4. Test Infrastructure: - Added auto_cleanup_searcher fixture for explicit resource management - Global cleanup now more aggressive with ZMQ context destruction This is the proper fix - cleaning up resources at the source, not just working around the issue in tests. The hanging was caused by C++ side ZMQ connections not being properly terminated when is_recompute=True.	2025-08-08 17:54:03 -07:00
Andy Lee	e3762458fc	fix: prevent test runner hanging on Python 3.9/3.13 due to ZMQ and process cleanup issues Based on excellent analysis from user, implemented comprehensive fixes: 1. ZMQ Socket Cleanup: - Set LINGER=0 on all ZMQ sockets (client and server) - Use try-finally blocks to ensure socket.close() and context.term() - Prevents blocking on exit when ZMQ contexts have pending operations 2. Global Test Cleanup: - Added tests/conftest.py with session-scoped cleanup fixture - Cleans up leftover ZMQ contexts and child processes after all tests - Lists remaining threads for debugging 3. CI Improvements: - Apply timeout to ALL Python versions on Linux (not just 3.13) - Increased timeout to 180s for better reliability - Added process cleanup (pkill) on timeout 4. Dependencies: - Added psutil>=5.9.0 to test dependencies for process management Root cause: Python 3.9/3.13 are more sensitive to cleanup timing during interpreter shutdown. ZMQ's default LINGER=-1 was blocking exit, and atexit handlers were unreliable for cleanup. This should resolve the 'all tests pass but CI hangs' issue.	2025-08-08 15:57:22 -07:00
Andy Lee	8714472cd8	fix: prevent hang in CI by flushing print statements and redirecting embedding server output - Add flush=True to all print statements in convert_to_csr.py to prevent buffer deadlock - Redirect embedding server stdout/stderr to DEVNULL in CI environment (CI=true) - Fix timeout in embedding_server_manager.stop_server() final wait call	2025-08-07 21:53:58 -07:00
Andy Lee	575b354976	style: organize imports per ruff; finish py39 Optional changes - Fix import ordering in embedding servers and graph_partition_simple - Remove duplicate Optional import - Complete Optional[...] replacements	2025-08-07 15:06:25 -07:00
Andy Lee	677eb0bae3	fix: Python 3.9 compatibility - replace Union type syntax - Replace 'int \| None' with 'Optional[int]' everywhere - Replace 'subprocess.Popen \| None' with 'Optional[subprocess.Popen]' - Add Optional import to all affected files - Update ruff target-version from py310 to py39 - The '\|' syntax for Union types was introduced in Python 3.10 (PEP 604) Fixes TypeError: unsupported operand type(s) for \|: 'type' and 'NoneType'	2025-08-07 12:54:16 -07:00
Andy Lee	0cb0463929	fix: always use relative path in metadata	2025-08-06 21:27:43 -07:00
Andy Lee	d505dcc5e3	Fix/OpenAI embeddings cosine distance (#10 ) * fix: auto-detect normalized embeddings and use cosine distance - Add automatic detection for normalized embedding models (OpenAI, Voyage AI, Cohere) - Automatically set distance_metric='cosine' for normalized embeddings - Add warnings when using non-optimal distance metrics - Implement manual L2 normalization in HNSW backend (custom Faiss build lacks normalize_L2) - Fix DiskANN zmq_port compatibility with lazy loading strategy - Add documentation for normalized embeddings feature This fixes the low accuracy issue when using OpenAI text-embedding-3-small model with default MIPS metric. * style: format * feat: add OpenAI embeddings support to google_history_reader_leann.py - Add --embedding-model and --embedding-mode arguments - Support automatic detection of normalized embeddings - Works correctly with cosine distance for OpenAI embeddings * feat: add --use-existing-index option to google_history_reader_leann.py - Allow using existing index without rebuilding - Useful for testing pre-built indices * fix: Improve OpenAI embeddings handling in HNSW backend	2025-07-28 14:35:49 -07:00
Andy Lee	5c8921673a	fix: auto-detect normalized embeddings and use cosine distance (#8 ) * fix: auto-detect normalized embeddings and use cosine distance - Add automatic detection for normalized embedding models (OpenAI, Voyage AI, Cohere) - Automatically set distance_metric='cosine' for normalized embeddings - Add warnings when using non-optimal distance metrics - Implement manual L2 normalization in HNSW backend (custom Faiss build lacks normalize_L2) - Fix DiskANN zmq_port compatibility with lazy loading strategy - Add documentation for normalized embeddings feature This fixes the low accuracy issue when using OpenAI text-embedding-3-small model with default MIPS metric. * style: format	2025-07-27 21:19:29 -07:00
yichuan520030910320	af1790395a	fix ruff errors and formatting	2025-07-27 02:22:54 -07:00
Andy Lee	b3e9ee96fa	fix: resolve all ruff linting errors and add lint CI check - Fix ambiguous fullwidth characters (commas, parentheses) in strings and comments - Replace Chinese comments with English equivalents - Fix unused imports with proper noqa annotations for intentional imports - Fix bare except clauses with specific exception types - Fix redefined variables and undefined names - Add ruff noqa annotations for generated protobuf files - Add lint and format check to GitHub Actions CI pipeline	2025-07-26 22:38:13 -07:00
yichuan520030910320	0692bbf7a2	change workflow	2025-07-25 17:11:56 -07:00
Andy Lee	2baaa4549b	fix: handle relative paths in HNSW embedding server metadata - Convert relative paths to absolute paths based on metadata file location - Fixes FileNotFoundError when starting embedding server - Resolves issue with passages file not found in different working directories	2025-07-25 16:09:53 -07:00
yichuan520030910320	52153bbb69	update faiss compare	2025-07-25 01:45:50 -07:00
Andy Lee	43155d2811	fix: supress resources leak logs	2025-07-22 19:53:45 -07:00
Andy Lee	8513471573	feat: make diskann runnable	2025-07-22 14:26:03 -07:00
Andy Lee	c2f35c8e73	fix: logs	2025-07-21 23:02:13 -07:00
Andy Lee	573313f0b6	refactor: logs	2025-07-21 22:45:24 -07:00
yichuan520030910320	587ce65cf6	Merge branch 'main' of https://github.com/yichuan-w/LEANN	2025-07-21 21:54:27 -07:00
yichuan520030910320	ccf6c8bfd7	fix flush print	2025-07-21 21:54:20 -07:00
Andy Lee	c112956d2d	fix: mlx	2025-07-21 21:29:15 -07:00
Andy Lee	b3970793cf	fix: cache the loaded model	2025-07-21 21:20:53 -07:00
Andy Lee	1b6272ce0e	Building, CLI tool & Embedding Server Fixed (#5 ) * chore: shorter build time * chore: update faiss * fix: no longger do embedding server reuse * fix: do not reuse emb_server and close it properly * feat: cli tool * feat: cli more args * fix: same embedding logic	2025-07-21 20:17:25 -07:00
yichuan520030910320	0796a52df1	change wecaht app split logic	2025-07-19 19:43:30 -07:00
Andy Lee	71c7de9c84	fix: build with direct embedding	2025-07-17 21:49:36 -07:00
Andy Lee	1c5fec5565	perf: make embedder loading faster by 6x, and embed queries through the server	2025-07-17 20:08:06 -07:00
Andy Lee	a13c527e39	feat: openai embeddings	2025-07-17 17:02:47 -07:00
yichuan520030910320	51255bdffa	update readme and add timer	2025-07-16 17:15:51 -07:00
Andy Lee	2a1a152073	refactor: nits	2025-07-16 15:39:58 -07:00
Andy Lee	7b9406a3ea	feat: different search_args and docstrings	2025-07-16 15:25:58 -07:00
Andy Lee	6a1dc895fb	feat: disable warmup by default	2025-07-15 22:16:02 -07:00
yichuan520030910320	e5a9ca8787	fix mem compare	2025-07-14 22:55:10 -07:00
Andy Lee	baa60b40d1	fix: smaller warmup id	2025-07-14 15:20:45 -07:00
Andy Lee	3da5b44d7f	fix: mlx when searching, added to embedding_server	2025-07-14 01:11:21 -07:00
Andy Lee	8b4654921b	fix: run faiss in subprocess to prevent kmp	2025-07-14 00:29:21 -07:00
Andy Lee	3b5a185e60	refactor: check if current emb_server has correct passages/embedder	2025-07-13 22:43:51 -07:00
Andy Lee	eb6f504789	Datastore reproduce (#3 ) * fix: diskann zmq port and passages * feat: auto discovery of packages and fix passage gen for diskann * docs: embedding pruning * refactor: passage structure * feat: reproducible research datas, rpj_wiki & dpr * refactor: chat and base searcher * feat: chat on mps	2025-07-11 23:37:23 -07:00
yichuan520030910320	6497e17671	add gpu chunk embedd and add complexity in hnsw	2025-07-08 18:40:52 +00:00
yichuan520030910320	637dab379e	add workaround code	2025-07-07 23:13:47 +00:00
Andy Lee	cf17c85607	Make DiskANN and HNSW work on main example (#2 ) * fix: diskann zmq port and passages * feat: auto discovery of packages and fix passage gen for diskann	2025-07-05 22:18:12 -07:00
Andy Lee	a38bc0a3fc	refactor: embedding server manager	2025-07-06 01:54:46 +00:00
Andy Lee	910927a405	feat: support more embedders	2025-07-06 00:35:07 +00:00
Andy Lee	0aa84e147b	feat: hnsw embedding server and csr format	2025-07-05 23:04:41 +00:00
yichuan520030910320	368474d036	fix larger file read and add faq	2025-07-03 23:25:36 +00:00
yichuan520030910320	a627abe794	fix file path bug still compatiable bug in hnsw search	2025-07-03 02:02:42 +00:00
yichuan520030910320	44815ee7fd	add configuable funcname	2025-07-02 05:18:00 +00:00
yichuan520030910320	46f6cc100b	Initial commit	2025-06-30 09:05:05 +00:00

46 Commits