go: x/crypto/ssh: deadlock during key renegotiation
What version of Go are you using (go version)?
1.7.x - though in this case the more relevant bit is probably the sha of x/crypto/ssh: a8a4eed
What operating system and processor architecture are you using (go env)?
Linux, amd64
What did you do?
Establish a long-lived connection used for SSH tunneling, including sending arbitrary amounts of data over the wire (e.g. ~200MB sent every now and then via the same connection).
What did you expect to see?
SSH traffic should continue unhindered over an arbitrary amount of time with arbitrary amounts of data flowing over the connection.
What did you see instead?
I’ve observed that after some time all communication over SSH will hang. The connection is still “live” in the TCP sense, but SSH-level keepalives just block waiting for a reply.
I suspect this is actually space-related, and not time, however; the default of 1GB of transfer for key renegotiation sounds very likely given the stack inspection below.
Debugging
Context
(You can skip this if you just want the nitty-gritty.)
This package is used in a CI system wherein all workers register themselves with the master via a custom-built SSH server. They’ll open a reverse SSH tunnel so that they can be used from a private network - the SSH server will act as a proxy to the workers. These SSH connections are arbitrarily long lived, and will have a large amount of data transferred over them as various CI resources (i.e. git repos, build artifacts) are streamed to and from the workers.
Procedure
In my case both ends (client and server) are implemented using this package. I suspect this to be a server-side issue as when bouncing the server all clients will immediately notice this and reconnect. I’ve also looked at the stack dump of the server and (I think) found a smoking gun for the deadlock.
Code sleuthing
I observed that for every hanging client there was one goroutine stuck in chan send for the duration that the worker had been gone. The following is my theory as to how that got there:
(*handshakeTransport).readOnePacketand(*handshakeTransport).writePacketcheck if the number of bytes exceedt.config.RekeyThreshold, which defaults to 1 gigabyte- this means that after 1GB of data transfer, the
crypto/sshpackage will initiate a new key exchange and sett.sentInitMsgto a non-nil value (*handshakeTransport).writePackethas a loop ont.sentInitMsg != nilusing a condition variable (i.e. it’s waiting for that to becomenil, so the kex has finished, or for there to be a write error that will cause it to give up)t.sentInitMsgonly becomes nil in a call to(*handshakeTransport).readOnePacket- the only caller of
readOnePacketis(*handshakeTransport).readLoop - in our stack, we observed
readLoopis blocked on writing to theincomingchannel, so it cannot callreadOnePacket - interestingly,
incomingis a buffered channel (length 16) - however in this case that didn’t help, so the buffer must be full, or no one can read from it - who reads from
t.incoming?(*handshakeTransport).readPacketdoes. - who calls
readPacket?(*mux).onePacket - who calls
onePacket?(*mux).loop- which isn’t too interesting though since it literally just keeps calling that method as long as it works. so let’s go back intoonePacketand see what it’s doing after reading the packet, as that’d be why it’s not reading the next one. - in our case, it was in the call to
(*channel).handlePacket- so what’s that doing? - in our case, it’s blocked on a call to
(*channel).sendMessage. things are getting interesting here: our read loop is now writing. - following the stack down, this ends up back in
(*handleTransport).writePacket! which is dependent on us reading to succeed, asreadOnePacketis what completes the key exchange.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 20 (12 by maintainers)
fixed in https://github.com/golang/crypto/commit/2e74c773682f59dc50a56475f7918dd8fa6dcaf8