kubernetes: kubelet fails to heartbeat with API server with stuck TCP connections

Is this a BUG REPORT or FEATURE REQUEST?: /kind bug

What happened: operator is running an HA master setup with a LB in front. kubelet attempts to update node status, but tryUpdateNodeStatus wedges. based on the goroutine dump, the wedge happens when it attempts to GET the latest state of the node from the master. operator observed 15 minute intervals between attempts to update node status when kubelet could not contact master. assume this is when the LB ultimately closes the connection. the impact is that node controller then marked node as lost, and workload was evicted.

What you expected to happen: expected the kubelet to timeout client-side. right now, no kubelet->master communication has a timeout. ideally, the kubelet -> master communication would have a timeout based on the configuration of the node-status-update-frequency so that no single attempt to update status wedges future attempts.

How to reproduce it (as minimally and precisely as possible): see above.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 11
  • Comments: 36 (25 by maintainers)

Commits related to this issue

Most upvoted comments

Indeed: as far as I understand, the behaviour is not undefined, it’s just defined in Linux rather than in Go. I think the Go docs could be clearer on this. Here’s the relevant section from dup(2):

After a successful return from one of these system calls, the old and new file descriptors may be used interchangeably. They refer to the same open file description (see open(2)) and thus share file offset and file status flags; for example, if the file offset is modified by using lseek(2) on one of the descriptors, the offset is also changed for the other.

The two descriptors do not share file descriptor flags (the close-on-exec flag).

My code doesn’t modify flags after obtaining the fd, instead its only use is in a call to setsockopt(2). The docs for that call are fairly clear that it modifies properties of the socket referred to by the descriptor, not the descriptor itself:

getsockopt() and setsockopt() manipulate options for the socket referred to by the file descriptor sockfd.

I agree that the original descriptor being set to blocking mode is annoying. Go’s code is clear that this will not prevent anything from working, just that more OS threads may be required for I/O:

https://github.com/golang/go/blob/516f5ccf57560ed402cdae28a36e1dc9e81444c3/src/net/fd_unix.go#L313-L315

Given that a single Kubelet (or otherwise use of client-go) establishes a small number of long-lived connections to the apiservers, and that this will be fixed in Go 1.11, I don’t think this is a significant issue.

I am happy for this to be fixed in another way, but given we know that this works and does not require invasive changes to the apiserver to achieve, I think it is a reasonable solution. I have heard from several production users of Kubernetes that this has bitten them in the same way it bit us.

Since this issue has been re-opened, would there be any value in me re-opening my PR for this commit? Monzo has been running this patch in production since last July and it has eliminated this problem entirely, for all uses of client-go.

We’ve had three major events in the last few weeks that comes down to this problem. Watches set up through an elb node that gets replaced or scaled down cause large numbers of nodes to go not ready for 15 minutes causing very scary cluster turbulence. ( We’ve generally seen between a third to half the nodes go not ready ). We’re currently evaluating other ways to load balance the api servers for the components we currently send through the elb ( I haven’t poured through everything but I think that boils down to the kubelet and the proxy (possibly flannel) ).

one issue at a time 😃

persistent kubelet heartbeat failure results in all workloads being evicted. kube-proxy network issues are disruptive for some workloads, but not necessarily all

kube-proxy (and general client-go support) would need a different mechanism, since those components do not heartbeat with the api like the kubelet does. I’d recommend spawning a separate issue for kube-proxy handling of this condition.

This regressed, and was refixed in 1.14.3

See https://github.com/kubernetes/kubernetes/pull/78016

Few notes on these very valid concerns:

  • https://golang.org/pkg/net/#TCPConn.File is returning dup’ed filedescriptor, which AFAIK share all underneath structures in kernel, except for entry in file descriptor table, so either can be used with same results. Program should be aware not to try to use them simultaneously though, for exact same reasons.
  • today returned filedescriptor is set to blocking mode. Probably it can be mitigated by setting it back to nonblocking mode. In Go 1.11 returned fd is going to be in same blocking/unblocking mode as it was before .File() call: https://github.com/golang/go/issues/24942
  • Maybe it will not help simple watchers, I am not familiar with Informers internals, but I was under the impression that they are not only watching, but also periodically resyncing state, these resync would trigger outgoing data transfer which would then be detected .