libzmq: TCP reset causing dropped messages with PUSH/PULL
Issue description
The docs say that the TCP transport is reliable, however, I’m seeing dropped messages when using iptables to forcibly reset the TCP connection. Just dropping the packets works as expected not when the connection fully drops. http://api.zeromq.org/4-2:zmq-tcp
We found this issue in a production environment when a bad kernel configuration started sending TCP reset packets.
Is this expected behavior and is there anyway to make the message sending completely reliable?
Environment
- libzmq version (commit hash if unreleased): zeromq 4.3.1
- OS: Linux
Minimal test code / Steps to reproduce the issue
Create a PUSH/PULL pair with one thread sending messages and one receiving them and then forcibly interrupt the connection with tcp resets.
Run:
iptables -I OUTPUT -p TCP --dport 5333 -j REJECT --reject-with tcp-reset
... wait a second
iptables -F
iptables -X
Code and example output:
https://gist.github.com/d4l3k/d3e6d40495ff8bcc6e58515892816352
(while this code is in Go, the original discovery code was in C++)
What’s the actual result? (include assertion message & call stack if applicable)
Messages are dropped after connection comes back.
What’s the expected result?
No messages are dropped.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (7 by maintainers)
@d4l3k I was a bit in a hurry earlier - to be more specific: if the TCP connection dies for any reason after a message was sent by the application thread and it was processed by the I/O thread and given to the kernel, and the kernel cannot recover the connection and the buffer, then that message will be lost. There are no “overlay” ACK/NACK/sequence numbers in libzmq on top of TCP - it’s beyond its scope, and a bit redundant.
A protocol written using libzmq can employ seq numbers. This is typically done for PUB/SUB, which is not reliable by default. One could adapt the same monotonic seq number + asynchronous back channel for replays for other patterns, if TCP really starts breaking all the time. In general it should be the job of firewalls to avoid hostile RSTs being spammed to unsuspecting applications.