gorouter: Gorouter likely to lose routes when NATS VM crashes hard

We found out by hard crashing NATS VMs that gorouter (as well as other components, like CC) fail to quickly failover to the other NATSes in the cluster. This does not happen if the NATS process crashes (in this case the OS on the NATS VM will close the sockets), it happens only when the whole VM crashes.

While other components are also affected, gorouter is the worst affected because this problem leads to lost traffic.

All of our gorouter settings for timeouts are the default ones. We’re using cf-release 220, both NATS and gorouter are on openstack, the stemcell in use is 3215.

In practice we see that roughly 100-110s after the crash of the NATS VM the routers that were connected to that specific NATS VM forget the routes. After a few minutes the disconnection is finally detected and a new connection established.

We’re still investigating and confirming, but the most likely culprit is in the pingInterval setting: https://github.com/cloudfoundry/gorouter/blob/64cf29ce174e04a955b883e0d270e158d67bd176/config/config.go#L190-L203. While it’s true that the NATS client will detect the disconnection after two failed ping attempts, it’s not correct to say that the disconnection will happen within 2*PingInterval (as the comment above that code says): it will actually happen after 2 failed attempts, and this means that it will take between 2*PingInterval and 3*PingInterval to disconnect. (see https://github.com/nats-io/nats/blob/6f8a5734602782ce3ab3874b474a1b215acc7ed8/nats.go#L2186-L2193)

To be on the safe side, we should also consider that reconnection is not immediate (it should be fast, but it’s not immediate) so my guess is that the full, worst-case reconnection time is 3*PingInterval + (NumNATSServers-1)*NATSConnectionTimeout and all of this should be lower than dropletstalethresholdinseconds/2 - startresponsedelayintervalinseconds. This is the bare minimum, we may even want to add a small extra time buffer to avoid playing with fire.

Until this is fixed, a crash of one of the NATS VMs will most likely result in lost traffic.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 17 (7 by maintainers)

Most upvoted comments

@flawedmatrix are you sure #140 will fix the issue? Because it looks like that will only kick in if the natsClient thinks we’re not connected (https://github.com/cloudfoundry/gorouter/pull/140/files#diff-7ddfb3e035b42cd70649cc33393fe32cR91). The problem described in this ticket is that the natsClient thinks we’re still connected, even though the NATS server died.

@sharms @CAFxX

We are closing this issue because we pulled in a fix: https://github.com/cloudfoundry/gorouter/pull/140

This is already part of version 0.137.0 of routing-release and will be part of the next final release of CF Release.

Regards, Shash && Edwin