Based on excellent diagnostic suggestions, implemented multiple fixes:
1. Diagnostics:
- Added faulthandler to dump stack traces 10s before CI timeout
- Enhanced CI script with trap handler to show processes/network on timeout
- Added diag() function to capture pstree, processes, network listeners
2. ZMQ Socket Timeouts (critical fix):
- Added RCVTIMEO=1000ms and SNDTIMEO=1000ms to all client sockets
- Added IMMEDIATE=1 to avoid connection blocking
- Reduced searcher timeout from 30s to 5s
- This prevents infinite blocking on recv/send operations
3. Context.instance() Fix (major issue):
- NEVER call term() or destroy() on Context.instance()
- This was causing blocking as it waits for ALL sockets to close
- Now only set linger=0 without terminating
4. Enhanced Process Cleanup:
- Added _reap_children fixture for aggressive session-end cleanup
- Better recursive child process termination
- Added final wait to ensure cleanup completes
The 180s timeout was happening because:
- ZMQ recv() was blocking indefinitely without timeout
- Context.instance().term() was waiting for all sockets
- Child processes weren't being fully cleaned up
These changes should prevent the hanging completely.