quinn: Hang during Connecting.await for incoming connections

Running my previous test case further surfaces two more issues:

  • ConnectionError::Reset on line 82. This seems possibly a bug but like the ApplicationClosed error in my previous example it doesn’t block progress so is ignored;
  • Hanging in Connecting.await on line 103. strace shows packets are being sent and received, but this .await never returns.

I used the following to simulate an unreliable network on Linux:

tc qdisc add dev lo root netem delay 5ms 10ms 25% distribution normal loss 5% duplicate 5% reorder 40% 50%

Note this seems to be buggy on kernels < 4.18.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (6 by maintainers)

Most upvoted comments

I’m pretty sure I see what’s happening with the hang:

  1. The client’s first packet is being duplicated over a significant time period (about 22ms in one example). The first packet creates a connection which goes through the full life cycle and is ultimately closed and forgotten.
  2. The duplicate arrives. The server has no way to distinguish it from a genuine fresh connection attempt, so it proceeds with the full handshake procedure, including yielding a new Connecting future to the application. This pattern can be identified easily by looking for traces where initial_dcid=foo appears once and icid=foo appears twice.
  3. The client endpoint no longer has state for this connection, so it rejects the server’s handshake packets with a stateless reset.
  4. The server doesn’t have a current reset token from the client, so it cannot recognize the stateless resets, and responds with its own stateless resets. This cycle continues until the stateless reset size drops below the critical threshold and no longer prompts a response.
  5. Having received no acknowledgement, the server retransmits the handshake messages. This restarts the above cycle.
  6. The server’s retransmits add up until they’re halted by anti-amplification.

At this point the server cannot take any further action on the connection initiated by the duplicated packet. Because you’ve disabled the idle timeout, the connection is permanently hung. This is working as intended.

In summary, the idle timeout must not be disabled in environments where a client might disappear unexpectedly or packets may be duplicated and no other mechanism exists to clean up zombie connections. I’ll prepare a PR to update the documentation to clarify this.

I suspect this might be due to the somewhat dubious handshake state machine in draft 24. I’m going to try to get us updated to draft 27 and then revisit.