go: x/crypto/ssh: deadlock during key renegotiation

What version of Go are you using (`go version`)?

1.7.x - though in this case the more relevant bit is probably the sha of x/crypto/ssh: a8a4eed

What operating system and processor architecture are you using (`go env`)?

Linux, amd64

What did you do?

Establish a long-lived connection used for SSH tunneling, including sending arbitrary amounts of data over the wire (e.g. ~200MB sent every now and then via the same connection).

What did you expect to see?

SSH traffic should continue unhindered over an arbitrary amount of time with arbitrary amounts of data flowing over the connection.

What did you see instead?

I’ve observed that after some time all communication over SSH will hang. The connection is still “live” in the TCP sense, but SSH-level keepalives just block waiting for a reply.

I suspect this is actually space-related, and not time, however; the default of 1GB of transfer for key renegotiation sounds very likely given the stack inspection below.

Debugging

Context

(You can skip this if you just want the nitty-gritty.)

This package is used in a CI system wherein all workers register themselves with the master via a custom-built SSH server. They’ll open a reverse SSH tunnel so that they can be used from a private network - the SSH server will act as a proxy to the workers. These SSH connections are arbitrarily long lived, and will have a large amount of data transferred over them as various CI resources (i.e. git repos, build artifacts) are streamed to and from the workers.

Procedure

In my case both ends (client and server) are implemented using this package. I suspect this to be a server-side issue as when bouncing the server all clients will immediately notice this and reconnect. I’ve also looked at the stack dump of the server and (I think) found a smoking gun for the deadlock.

Code sleuthing

I observed that for every hanging client there was one goroutine stuck in chan send for the duration that the worker had been gone. The following is my theory as to how that got there:

(*handshakeTransport).readOnePacket and (*handshakeTransport).writePacket check if the number of bytes exceed t.config.RekeyThreshold, which defaults to 1 gigabyte
this means that after 1GB of data transfer, the crypto/ssh package will initiate a new key exchange and set t.sentInitMsg to a non-nil value
(*handshakeTransport).writePacket has a loop on t.sentInitMsg != nil using a condition variable (i.e. it’s waiting for that to become nil, so the kex has finished, or for there to be a write error that will cause it to give up)
t.sentInitMsg only becomes nil in a call to (*handshakeTransport).readOnePacket
the only caller of readOnePacket is (*handshakeTransport).readLoop
in our stack, we observed readLoop is blocked on writing to the incoming channel, so it cannot call readOnePacket
interestingly, incoming is a buffered channel (length 16) - however in this case that didn’t help, so the buffer must be full, or no one can read from it
who reads from t.incoming? (*handshakeTransport).readPacket does.
who calls readPacket? (*mux).onePacket
who calls onePacket? (*mux).loop - which isn’t too interesting though since it literally just keeps calling that method as long as it works. so let’s go back into onePacket and see what it’s doing after reading the packet, as that’d be why it’s not reading the next one.
in our case, it was in the call to (*channel).handlePacket - so what’s that doing?
in our case, it’s blocked on a call to (*channel).sendMessage. things are getting interesting here: our read loop is now writing.
following the stack down, this ends up back in (*handleTransport).writePacket! which is dependent on us reading to succeed, as readOnePacket is what completes the key exchange.