fix: comprehensive ZMQ timeout and cleanup fixes based on detailed analysis

Based on excellent diagnostic suggestions, implemented multiple fixes:

1. Diagnostics:
   - Added faulthandler to dump stack traces 10s before CI timeout
   - Enhanced CI script with trap handler to show processes/network on timeout
   - Added diag() function to capture pstree, processes, network listeners

2. ZMQ Socket Timeouts (critical fix):
   - Added RCVTIMEO=1000ms and SNDTIMEO=1000ms to all client sockets
   - Added IMMEDIATE=1 to avoid connection blocking
   - Reduced searcher timeout from 30s to 5s
   - This prevents infinite blocking on recv/send operations

3. Context.instance() Fix (major issue):
   - NEVER call term() or destroy() on Context.instance()
   - This was causing blocking as it waits for ALL sockets to close
   - Now only set linger=0 without terminating

4. Enhanced Process Cleanup:
   - Added _reap_children fixture for aggressive session-end cleanup
   - Better recursive child process termination
   - Added final wait to ensure cleanup completes

The 180s timeout was happening because:
- ZMQ recv() was blocking indefinitely without timeout
- Context.instance().term() was waiting for all sockets
- Child processes weren't being fully cleaned up

These changes should prevent the hanging completely.
This commit is contained in:
Andy Lee
2025-08-08 18:29:09 -07:00
parent a6dad47280
commit a35bfb0354
4 changed files with 87 additions and 25 deletions

View File

@@ -88,6 +88,9 @@ def compute_embeddings_via_server(chunks: list[str], model_name: str, port: int)
context = zmq.Context()
socket = context.socket(zmq.REQ)
socket.setsockopt(zmq.LINGER, 0) # Don't block on close
socket.setsockopt(zmq.RCVTIMEO, 1000) # 1s timeout on receive
socket.setsockopt(zmq.SNDTIMEO, 1000) # 1s timeout on send
socket.setsockopt(zmq.IMMEDIATE, 1) # Don't wait for connection
socket.connect(f"tcp://localhost:{port}")
try:
@@ -623,14 +626,15 @@ class LeannSearcher:
if hasattr(self.backend_impl, "embedding_server_manager"):
self.backend_impl.embedding_server_manager.stop_server()
# Force cleanup of ZMQ connections (especially for C++ side)
# Set ZMQ linger but don't terminate global context
try:
import zmq
# Aggressively terminate all ZMQ contexts to prevent hanging
# Just set linger on the global instance
ctx = zmq.Context.instance()
ctx.linger = 0
# Don't call destroy() here as it might affect other components
# NEVER call ctx.term() or destroy() on the global instance
# That would block waiting for all sockets to close
except Exception:
pass

View File

@@ -137,8 +137,10 @@ class BaseSearcher(LeannBackendSearcherInterface, ABC):
try:
context = zmq.Context()
socket = context.socket(zmq.REQ)
socket.setsockopt(zmq.RCVTIMEO, 30000) # 30 second timeout
socket.setsockopt(zmq.LINGER, 0) # Don't block on close
socket.setsockopt(zmq.RCVTIMEO, 5000) # 5 second timeout
socket.setsockopt(zmq.SNDTIMEO, 5000) # 5 second timeout
socket.setsockopt(zmq.IMMEDIATE, 1) # Don't wait for connection
socket.connect(f"tcp://localhost:{zmq_port}")
# Send embedding request
@@ -202,13 +204,14 @@ class BaseSearcher(LeannBackendSearcherInterface, ABC):
if hasattr(self, "embedding_server_manager"):
self.embedding_server_manager.stop_server()
# Force cleanup of ZMQ connections (especially for C++ side in HNSW/DiskANN)
# Set ZMQ linger but don't terminate global context
try:
import zmq
# Set short linger to prevent blocking
# Just set linger on the global instance
ctx = zmq.Context.instance()
ctx.linger = 0
# NEVER call ctx.term() on the global instance
except Exception:
pass