fix: comprehensive ZMQ timeout and cleanup fixes based on detailed analysis
Based on excellent diagnostic suggestions, implemented multiple fixes: 1. Diagnostics: - Added faulthandler to dump stack traces 10s before CI timeout - Enhanced CI script with trap handler to show processes/network on timeout - Added diag() function to capture pstree, processes, network listeners 2. ZMQ Socket Timeouts (critical fix): - Added RCVTIMEO=1000ms and SNDTIMEO=1000ms to all client sockets - Added IMMEDIATE=1 to avoid connection blocking - Reduced searcher timeout from 30s to 5s - This prevents infinite blocking on recv/send operations 3. Context.instance() Fix (major issue): - NEVER call term() or destroy() on Context.instance() - This was causing blocking as it waits for ALL sockets to close - Now only set linger=0 without terminating 4. Enhanced Process Cleanup: - Added _reap_children fixture for aggressive session-end cleanup - Better recursive child process termination - Added final wait to ensure cleanup completes The 180s timeout was happening because: - ZMQ recv() was blocking indefinitely without timeout - Context.instance().term() was waiting for all sockets - Child processes weren't being fully cleaned up These changes should prevent the hanging completely.
This commit is contained in:
@@ -88,6 +88,9 @@ def compute_embeddings_via_server(chunks: list[str], model_name: str, port: int)
|
||||
context = zmq.Context()
|
||||
socket = context.socket(zmq.REQ)
|
||||
socket.setsockopt(zmq.LINGER, 0) # Don't block on close
|
||||
socket.setsockopt(zmq.RCVTIMEO, 1000) # 1s timeout on receive
|
||||
socket.setsockopt(zmq.SNDTIMEO, 1000) # 1s timeout on send
|
||||
socket.setsockopt(zmq.IMMEDIATE, 1) # Don't wait for connection
|
||||
socket.connect(f"tcp://localhost:{port}")
|
||||
|
||||
try:
|
||||
@@ -623,14 +626,15 @@ class LeannSearcher:
|
||||
if hasattr(self.backend_impl, "embedding_server_manager"):
|
||||
self.backend_impl.embedding_server_manager.stop_server()
|
||||
|
||||
# Force cleanup of ZMQ connections (especially for C++ side)
|
||||
# Set ZMQ linger but don't terminate global context
|
||||
try:
|
||||
import zmq
|
||||
|
||||
# Aggressively terminate all ZMQ contexts to prevent hanging
|
||||
# Just set linger on the global instance
|
||||
ctx = zmq.Context.instance()
|
||||
ctx.linger = 0
|
||||
# Don't call destroy() here as it might affect other components
|
||||
# NEVER call ctx.term() or destroy() on the global instance
|
||||
# That would block waiting for all sockets to close
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
@@ -137,8 +137,10 @@ class BaseSearcher(LeannBackendSearcherInterface, ABC):
|
||||
try:
|
||||
context = zmq.Context()
|
||||
socket = context.socket(zmq.REQ)
|
||||
socket.setsockopt(zmq.RCVTIMEO, 30000) # 30 second timeout
|
||||
socket.setsockopt(zmq.LINGER, 0) # Don't block on close
|
||||
socket.setsockopt(zmq.RCVTIMEO, 5000) # 5 second timeout
|
||||
socket.setsockopt(zmq.SNDTIMEO, 5000) # 5 second timeout
|
||||
socket.setsockopt(zmq.IMMEDIATE, 1) # Don't wait for connection
|
||||
socket.connect(f"tcp://localhost:{zmq_port}")
|
||||
|
||||
# Send embedding request
|
||||
@@ -202,13 +204,14 @@ class BaseSearcher(LeannBackendSearcherInterface, ABC):
|
||||
if hasattr(self, "embedding_server_manager"):
|
||||
self.embedding_server_manager.stop_server()
|
||||
|
||||
# Force cleanup of ZMQ connections (especially for C++ side in HNSW/DiskANN)
|
||||
# Set ZMQ linger but don't terminate global context
|
||||
try:
|
||||
import zmq
|
||||
|
||||
# Set short linger to prevent blocking
|
||||
# Just set linger on the global instance
|
||||
ctx = zmq.Context.instance()
|
||||
ctx.linger = 0
|
||||
# NEVER call ctx.term() on the global instance
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
Reference in New Issue
Block a user