libzmq: TCP reset causing dropped messages with PUSH/PULL

Issue description

The docs say that the TCP transport is reliable, however, I’m seeing dropped messages when using iptables to forcibly reset the TCP connection. Just dropping the packets works as expected not when the connection fully drops. http://api.zeromq.org/4-2:zmq-tcp

We found this issue in a production environment when a bad kernel configuration started sending TCP reset packets.

Is this expected behavior and is there anyway to make the message sending completely reliable?

Environment

libzmq version (commit hash if unreleased): zeromq 4.3.1
OS: Linux

Minimal test code / Steps to reproduce the issue

Create a PUSH/PULL pair with one thread sending messages and one receiving them and then forcibly interrupt the connection with tcp resets.

Run:

iptables -I OUTPUT -p TCP --dport 5333 -j REJECT --reject-with tcp-reset 

... wait a second

iptables -F
iptables -X

Code and example output:

https://gist.github.com/d4l3k/d3e6d40495ff8bcc6e58515892816352

(while this code is in Go, the original discovery code was in C++)

What’s the actual result? (include assertion message & call stack if applicable)

Messages are dropped after connection comes back.

What’s the expected result?

No messages are dropped.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 15 (7 by maintainers)

Most upvoted comments

@d4l3k I was a bit in a hurry earlier - to be more specific: if the TCP connection dies for any reason after a message was sent by the application thread and it was processed by the I/O thread and given to the kernel, and the kernel cannot recover the connection and the buffer, then that message will be lost. There are no “overlay” ACK/NACK/sequence numbers in libzmq on top of TCP - it’s beyond its scope, and a bit redundant.

A protocol written using libzmq can employ seq numbers. This is typically done for PUB/SUB, which is not reliable by default. One could adapt the same monotonic seq number + asynchronous back channel for replays for other patterns, if TCP really starts breaking all the time. In general it should be the job of firewalls to avoid hostile RSTs being spammed to unsuspecting applications.

bluca on Feb 5, 2019