libzmq: Surplus of errno_assert() leading to deamon crash

A daemon is a program that is designed to run forever so every single error that is not fatal should be handled and the show must go on. Currently ZMQ has 404 errno_assert calls - 404 ways to make a daemon crash with SIGABRT. Please consider this function from tcp.cpp:

void zmq::tune_tcp_socket (fd_t s_) { // Disable Nagle’s algorithm. We are doing data batching on 0MQ level, // so using Nagle wouldn’t improve throughput in anyway, but it would // hurt latency. int nodelay = 1; int rc = setsockopt (s_, IPPROTO_TCP, TCP_NODELAY, (char*) &nodelay, sizeof (int)); #ifdef ZMQ_HAVE_WINDOWS wsa_assert (rc != SOCKET_ERROR); #else errno_assert (rc == 0); #endif

#ifdef ZMQ_HAVE_OPENVMS // Disable delayed acknowledgements as they hurt latency significantly. int nodelack = 1; rc = setsockopt (s_, IPPROTO_TCP, TCP_NODELACK, (char*) &nodelack, sizeof (int)); errno_assert (rc != SOCKET_ERROR); #endif }

When setsockopt() returns an error, your daemon would crash. And there is a trivial error-free scenario when this could happen - remote side can send TCP Reset packet that will immediately invalidate the socket but instead of reconnecting, ZMQ will crash whole app.

I was debugging my app that coredumped at this particular function:

Thread 1 (Thread 802007c00 (LWP 101563/firsthop-receiver)): #0 0x0000000801896dcc in thr_kill () from /lib/libc.so.7 #1 0x000000080193d72b in abort () from /lib/libc.so.7 #2 0x0000000000415ac1 in zmq::zmq_abort (errmsg_=Could not find the frame base for “zmq::zmq_abort(char const*)”. ) at src/err.cpp:84 #3 0x0000000000453a6e in zmq::tune_tcp_socket (s_=17) at src/tcp.cpp:60 #4 0x0000000000454524 in zmq::tcp_connecter_t::out_event (this=0x80285a600) at src/tcp_connecter.cpp:134 #5 0x0000000000416be6 in zmq::kqueue_t::loop (this=0x802051300) at src/kqueue.cpp:205 #6 0x0000000000416ce5 in zmq::kqueue_t::worker_routine (arg_=0x802051300) at src/kqueue.cpp:222 #7 0x0000000000434bd8 in thread_routine (arg_=0x802051380) at src/thread.cpp:96 #8 0x0000000801618e14 in pthread_getprio () from /lib/libthr.so.3 #9 0x0000000000000000 in ?? () (gdb) thread 1 [Switching to thread 1 (Thread 802007c00 (LWP 101563/firsthop-receiver))]#3 0x0000000000453a6e in zmq::tune_tcp_socket (s_=17) at src/tcp.cpp:60 60 errno_assert (rc == 0); (gdb) p errstr $4 = 0x801b7b240 “Connection reset by peer” (gdb)

Sure I can rewrite this function to ignore failure non-disabled Naggle and delayed-ACKs, but 402 of errno_assert()s will remain in code. Am I missing something?

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 20 (14 by maintainers)

Most upvoted comments

Can confirm this client reliably crashes our 4.2.0 services, tested against both ROUTER and PUB sockets on a TCP endpoint (although reading the report I would not expect the ZMQ socket pattern to matter).

Please read again the links and the discussion - these are all errors that are intentionally causing a sigabort so that they are found immediately and fixed