fix: comprehensive ZMQ timeout and cleanup fixes based on detailed analysis

Based on excellent diagnostic suggestions, implemented multiple fixes:

1. Diagnostics:
   - Added faulthandler to dump stack traces 10s before CI timeout
   - Enhanced CI script with trap handler to show processes/network on timeout
   - Added diag() function to capture pstree, processes, network listeners

2. ZMQ Socket Timeouts (critical fix):
   - Added RCVTIMEO=1000ms and SNDTIMEO=1000ms to all client sockets
   - Added IMMEDIATE=1 to avoid connection blocking
   - Reduced searcher timeout from 30s to 5s
   - This prevents infinite blocking on recv/send operations

3. Context.instance() Fix (major issue):
   - NEVER call term() or destroy() on Context.instance()
   - This was causing blocking as it waits for ALL sockets to close
   - Now only set linger=0 without terminating

4. Enhanced Process Cleanup:
   - Added _reap_children fixture for aggressive session-end cleanup
   - Better recursive child process termination
   - Added final wait to ensure cleanup completes

The 180s timeout was happening because:
- ZMQ recv() was blocking indefinitely without timeout
- Context.instance().term() was waiting for all sockets
- Child processes weren't being fully cleaned up

These changes should prevent the hanging completely.
This commit is contained in:
Andy Lee
2025-08-08 18:29:09 -07:00
parent a6dad47280
commit a35bfb0354
4 changed files with 87 additions and 25 deletions

View File

@@ -263,14 +263,32 @@ jobs:
# Activate virtual environment
source .venv/bin/activate || source .venv/Scripts/activate
# Define diagnostic function for debugging hangs
diag() {
echo "===== DIAG BEGIN ====="
date
echo "# pstree (current shell group)"
pstree -ap $$ 2>/dev/null || true
echo "# python/pytest processes"
ps -ef | grep -E 'python|pytest|embedding|zmq|diskann' | grep -v grep || true
echo "# network listeners"
ss -ltnp 2>/dev/null || netstat -ltn 2>/dev/null || true
echo "# pytest PIDs"
pgrep -fa pytest || true
echo "===== DIAG END ====="
}
# Run all tests with timeout on Linux to prevent hanging
if [[ "$RUNNER_OS" == "Linux" ]]; then
echo "Running tests with timeout (Linux)..."
timeout --signal=INT 180 pytest tests/ -v || {
# Set trap for diagnostics
trap diag INT TERM
timeout --signal=INT 180 pytest tests/ -vv --maxfail=3 || {
EXIT_CODE=$?
if [ $EXIT_CODE -eq 124 ]; then
echo "⚠️ Tests timed out after 180 seconds - likely process cleanup issue"
echo "Check for lingering ZMQ connections or child processes"
echo "⚠️ Tests timed out after 180 seconds - dumping diagnostics..."
diag
# Try to clean up any leftover processes
pkill -TERM -P $$ || true
sleep 1
@@ -281,7 +299,7 @@ jobs:
else
# For macOS/Windows, run without GNU timeout
echo "Running tests ($RUNNER_OS)..."
pytest tests/ -v
pytest tests/ -vv --maxfail=3
fi
- name: Run sanity checks (optional)