kubernetes: (1.17) Kubelet won't reconnect to Apiserver after NIC failure (use of closed network connection)
We’ve just upgrade our production cluster to 1.17.2.
Since the update on saturday, we’ve had this strange outage: Kubelet, after a NIC bond fail (that recovers not long after), will have all of its connections broken and won’t retry to restablish them unless manually restarted.
Here is the timeline of last time it occured:
01:31:16: Kernel recognizes a fail on the bond interface. It goes for a while. Eventually it recovers.
Jan 28 01:31:16 baremetal044 kernel: bond-mngmt: link status definitely down for interface eno1, disabling it
...
Jan 28 01:31:37 baremetal044 systemd-networkd[1702]: bond-mngmt: Lost carrier
Jan 28 01:31:37 baremetal044 systemd-networkd[1702]: bond-mngmt: Gained carrier
Jan 28 01:31:37 baremetal044 systemd-networkd[1702]: bond-mngmt: Configured
As expected, all watches are closed. Message is the same for them all:
...
Jan 28 01:31:44 baremetal044 kubelet-wrapper[2039]: W0128 04:31:44.352736 2039 reflector.go:326] object-"namespace"/"default-token-fjzcz": watch of *v1.Secret ended with: very short watch: object-"namespace"/"default-token-fjzcz": Unexpected watch close - watch lasted less than a second and no items received
...
So these messages begin:
`Jan 28 01:31:44 baremetal44 kubelet-wrapper[2039]: E0128 04:31:44.361582 2039 desired_state_of_world_populator.go:320] Error processing volume "disco-arquivo" for pod "pod-bb8854ddb-xkwm9_namespace(8151bfdc-ec91-48d4-9170-383f5070933f)": error processing PVC namespace/disco-arquivo: failed to fetch PVC from API server: Get https://apiserver:443/api/v1/namespaces/namespace/persistentvolumeclaims/disco-arquivo: write tcp baremetal44.ip:42518->10.79.32.131:443: use of closed network connection`
Which I’m guessing shouldn’t be a problem for a while. But it never recovers. Our event came to happen at 01:31 AM, and had to manually restart Kubelet around 9h to get stuff normalized.
# journalctl --since '2020-01-28 01:31' | fgrep 'use of closed' | cut -f3 -d' ' | cut -f1 -d1 -d':' | sort | uniq -dc
9757 01
20663 02
20622 03
20651 04
20664 05
20666 06
20664 07
20661 08
16655 09
3 10
Apiservers were up and running, all other nodes were up and running, everything else pretty uneventful. This one was the only one affected (today) by this problem.
Is there any way to mitigate this kind of event?
Would this be a bug?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 45
- Comments: 163 (85 by maintainers)
Commits related to this issue
- scripts: Restart kubelet on all nodes during restore For kubelet version 1.16+ in case of connection failure kubelet will not reconnect until service restart, so restart kubelet during restore script... — committed to scality/metalk8s by TeddyAndrieux 4 years ago
- scripts: Restart kubelet on all nodes during restore For kubelet version 1.16+ in case of connection failure kubelet will not reconnect until service restart, so restart kubelet during restore script... — committed to scality/metalk8s by TeddyAndrieux 4 years ago
- kube-apiserver: disable http2 There are several kubernetes bugs [0,1,2] involving connection problems that seem related to the Go net/http2 library, where the stream state and connection state can ge... — committed to airshipit/promenade by philsphicas 4 years ago
- Update golang.org/x/net to 08b38378de70 Fixing #87615 requires update to x/net. $ ./hack/pin-dependency.sh golang.org/x/net 08b38378de70 $ ./hack/lint-dependencies.sh // Required by lint dependency ... — committed to povsister/kubernetes by povsister 4 years ago
- Bump Golang to 1.16 Golang 1.13 has horrible bugs that might have affected MetalLB as it's using client-go / watches, see https://github.com/kubernetes/kubernetes/issues/87615 Newer client-go version... — committed to champtar/metallb by champtar 3 years ago
- Bump Golang to 1.16 Golang 1.13 has horrible bugs that might have affected MetalLB as it's using client-go / watches, see https://github.com/kubernetes/kubernetes/issues/87615 Newer client-go version... — committed to metallb/metallb by champtar 3 years ago
- Bump Golang to 1.16 Golang 1.13 has horrible bugs that might have affected MetalLB as it's using client-go / watches, see https://github.com/kubernetes/kubernetes/issues/87615 Newer client-go version... — committed to cilium/metallb by champtar 3 years ago
- test/contrib: Bump CoreDNS version to 1.8.3 As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which could result in lost connections to kube-apiserver. Worse than this was that the c... — committed to cilium/cilium by brb 3 years ago
- test/contrib: Bump CoreDNS version to 1.8.3 As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which could result in lost connections to kube-apiserver. Worse than this was that the c... — committed to cilium/cilium by brb 3 years ago
- test/contrib: Bump CoreDNS version to 1.8.3 As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which could result in lost connections to kube-apiserver. Worse than this was that the c... — committed to cilium/cilium by brb 3 years ago
- test/contrib: Bump CoreDNS version to 1.8.3 [ upstream commit 398d55cd94c0e16dc19b03c53f7b5040c1dd8f13 ] As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which could result in lost... — committed to cilium/cilium by brb 3 years ago
- test/contrib: Bump CoreDNS version to 1.8.3 [ upstream commit 398d55cd94c0e16dc19b03c53f7b5040c1dd8f13 ] As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which could result in lost... — committed to cilium/cilium by brb 3 years ago
- test/contrib: Bump CoreDNS version to 1.8.3 [ upstream commit 398d55cd94c0e16dc19b03c53f7b5040c1dd8f13 ] As reported in [1], Go's HTTP2 client < 1.16 had some serious bugs which could result in lost... — committed to cilium/cilium by brb 3 years ago
I’ve fixed it by running this bash script every 5 minutes:
After upgrading to Kubernetes 1.16, we also started seeing the
use of closed network connection
error and kubelet not reconnecting to the apiserver, leaving nodes stuck in NotReady. We weren’t able to reproduce the issue by taking down NICs (by setting the links down/up), but we did notice this behavior only happened on clusters that were more heavily loaded.We did more digging and found that the server-side default in golang is 250 http2 streams per client, while the client-side default is 1000, so my guess is that once kubelet got an error from the apiserver for hitting the http2 stream limit, it never tried to reconnect. After setting
--http2-max-streams-per-connection=1000
we didn’t see the problem with nodes getting stuck in NotReady as much as originally found during testing. This didn’t resolve the issue of kubelet not reconnecting, but it did help us mitigate the issue we were seeing.Hi guys. I think finally I have tracked down this issue. We have the same issue happened last night. But successfully recovered with a modified kubelet.
It’s not a Kubernetes bug, it’s about golang’s standard
net/http
package whichclient-go
is using,too. I believe there is a flaw ingolang.org/x/net/http2/transport.go
Already have this reported to golang official. Waiting for some discussion. https://github.com/golang/go/issues/39750
For now I modified the code to have the
http2: perform connection health check
introduced by https://github.com/golang/net/commit/0ba52f642ac2f9371a88bfdde41f4b4e195a37c0 enabled by default. It proves to be some help on this problem. But a little slow-responded.kubelet v1.17.6 logs (complied with self-modified
golang.org/x/net
package)It do recovered from writing dead connections problem, but cost a little more time than expected.
Be noted that
performing http2 healthCheck
is a log message that I intended leave there to provehealthCheck
func is called byreadIdleTimer
no more
use of closed network connection
reported and kubelet returns to Ready stateI have the same issue with a Raspberry Pi cluster with k8s 1.17.3 very often. Based on some older issues, i have set the kube API server http connection limit to 1000 “- --http2-max-streams-per-connection=1000”, it was fine for more than 2 weeks after that it starts now again.
I know this is silly, but must work as a temporary workaround:
Expand yaml
Amount of time for
sleep
and--since
must be less than cluster’spod-eviction-timeout
(5m by default)BTW -
docker pause nginx-proxy
on rancher worker node makes kubelet produce the same error message.I am seeing this same behavior. Ubuntu 18.04.3 LTS clean installs. Cluster built with rancher 2.3.4. I have seen this happen periodically lately and just restarting kubelet tends to fix it for me. Last night all 3 of my worker nodes exhibited this same behavior. I corrected 2 to bring my cluster up. Third is still in this state while im digging around.
Hi, from my experiments, these two patches: https://go-review.googlesource.com/c/net/+/243257 https://go-review.googlesource.com/c/go/+/243258
from https://github.com/golang/go/issues/40213 are also very important, if we want to be 100% sure.
The reason is:
func (t *Transport) newClientConn
may fail: https://github.com/golang/net/blob/d523dce5a7f4b994f7ed0531dbe44cd8fd803e26/http2/transport.go#L700-L703func (c *addConnCall) run(t *Transport, key string, tc *tls.Conn)
will remember this error and pass it toupgradeFn
throughfunc (p *clientConnPool) addConnIfNeeded
, which will return anerringRoundTripper{err}
https://github.com/golang/net/blob/d523dce5a7f4b994f7ed0531dbe44cd8fd803e26/http2/client_conn_pool.go#L195-L197 https://github.com/golang/net/blob/d523dce5a7f4b994f7ed0531dbe44cd8fd803e26/http2/client_conn_pool.go#L177-L179 https://github.com/golang/net/blob/d523dce5a7f4b994f7ed0531dbe44cd8fd803e26/http2/transport.go#L192-L195erringRoundTripper
is terrible, itsRoundTrip
method will do nothing other than return the error: https://github.com/golang/net/blob/d523dce5a7f4b994f7ed0531dbe44cd8fd803e26/http2/transport.go#L2554So, this connection with
erringRoundTripper
must be removed.in
src/net/http/transport.go
, there isisH2DialError
logic to remove it: https://github.com/golang/go/blob/5cf057ddedfbb149b71c85ec86050431dd6b2d9d/src/net/http/transport.go#L591-L596However, it uses
pconn.alt.(http2erringRoundTripper)
, which is different fromerringRoundTripper
, and this assertion returns false! Andfunc (t *Transport) removeIdleConn(pconn *persistConn) bool
is not called.So the next time, in http’s
Transport.roundTrip
, it will still pick up the faultypconn
fromTransport.idleConn
.Transport.idleConn[x][0].alt
is actually ahttp2erringRoundTripper
, so it will err out without actually doing anything.https://github.com/golang/go/blob/5cf057ddedfbb149b71c85ec86050431dd6b2d9d/src/net/http/transport.go#L565-L569 https://github.com/golang/go/blob/5cf057ddedfbb149b71c85ec86050431dd6b2d9d/src/net/http/transport.go#L1329 https://github.com/golang/go/blob/5cf057ddedfbb149b71c85ec86050431dd6b2d9d/src/net/http/transport.go#L1008 https://github.com/golang/go/blob/5cf057ddedfbb149b71c85ec86050431dd6b2d9d/src/net/http/transport.go#L1828-L1831
So, the culprit is
pconn.alt.(http2erringRoundTripper)
, which is exactly what https://go-review.googlesource.com/c/net/+/243257 https://go-review.googlesource.com/c/go/+/243258 fixes.Is there a plan to cherry-pick/backport https://github.com/kubernetes/kubernetes/pull/95981 and https://github.com/kubernetes/kubernetes/issues/87615 to 1.18 release branch?
We have some very good news regarding mitigation of the issue using the new golang http2 health check feature: no issues any more. By now, we implemented the “fix” (hardcoded setting of the value in vendored
x/net
code) in Prometheus, entire Kubernetes and several internal components, observing:Furthermore, we have been able to get new insights on how to trigger the issues. I can confirm @vi7’s observation regarding live migration with some confidence (we could trace it down though), and at least with the NSX version we’re running also load balancer changes can trigger such issues (we have a ticket with VMware to make sure they’re sending reset packets in future). And very likely lots of other reasons dropping connections in-between, like connection table overflows.
This is a very annoying and somewhat massive issue for some users of Kubernetes (depending on some kind of “brokeness” of the IaaS layer/networking, I guess). Although there are golang discussions about exposing an interface to properly set the values – do you think there is any chance of getting a PR merged upstream setting those values through reflection (still better than forking x/net I guess like we do right now)? We’re fine with providing the code (and validating the fix, we cannot actually reproduce, but observe it often enough to be able to confirm whether the fix works).
If the underlying issue was “http2 using dead connections”, then restarting kubelet should fix the problem. https://github.com/kubernetes/kubernetes/pull/48670 suggested reducing TCP_USER_TIMEOUT can mitigate the problem. I have opened https://github.com/golang/net/pull/55 to add client-side connection health check to the http2 library, but it’s going to take more time to land.
If restarting kubelet didn’t solve the issue, then probably it’s a different root cause.
你好,你的邮件已收到,祝你身体健康,学业进步
Very impressive for your analysis. Nice work @KevinWang15 👍 If I understood the things correct, there is no further problem if you are just using the bundled http2 package in prior Go 1.16 release, because the type-assertion only fails with an explicit configured h2 transport, as Kubernetes does. So, for Kubernetes, it needs the fix from https://golang.org/cl/243258 on Go’s h1 stack. That’s a little hurt 😆
Currently we are using K8S 1.19.x from official distribution on our stack, and we did not observe any connection problem with kubelet. Though it’s probably very very rare that h2 handshake fails after a successful TLS handshake due to connection problem in production, eliminating every possible bug is still an instinct for engineer. Let’s do the hurt things for good.
Thank you @KevinWang15 for the great analysis!
I took a look at https://github.com/golang/go/issues/34978 (reopened as https://github.com/golang/go/issues/40213). A brief summary:
More details of how https://github.com/golang/go/issues/34978 happens:
erringRoundTripper
returned here is wrongly cached in the idle connection pool of the HTTP/1 Transport.erringRoundTripper
stuck there forever.erringRoundTripper
is not a real h2 client connection, so the http2 health check won’t help recover it.Another good news: Kubernetes do not use the bundled http2 in standard net/http package, so we do not need to wait for the next Go release. We can directly use https://github.com/golang/net/commit/08b38378de702b893ee869b94b32f833e2933bd2 to fix this issue.
Proposed a fix here. https://github.com/kubernetes/kubernetes/pull/95898 It updates dependency to required version and enables http2 Transport health check by default. It should help applications who use client-go to communicate with apiserver(eg: kubelet) get rid of “app hang at write tcp xxx: use of closed connection” problem.
Feel free to leave any comments.
Good news: golang/net master has support for configuring http2 transports, which now allows setting the timeouts! https://github.com/golang/net/commit/08b38378de702b893ee869b94b32f833e2933bd2
Just another case:
We are seeing this in Kube 1.16.8 clusters as well. Rebooting the VM can be used to bring the node back to a good state (I suspect a kubelet restart would have worked as well).
Our setup kubelet talks to a local haproxy instance over localhost which acts as a tcp load balancer to the multiple backend master instances. We are going to investigate if adding
To our load balancer instances help alleviate the need for the explicit reboot and can lead to a full recovery. Example of repeated logs
Will post an update if that solves our specific problem in case that helps anyone here in the interim.
Curious if there’s a config parameter to set an absolute upper bound on a watch time? I found --streaming-idle-connection-timeout but nothing specific for watches.
same here.(ubuntu 18.04, kubernetes 1.17.3)
@cloud-66 I know, the point here is that a kubelet restart fixes the issue, so this is not fixed in 1.18.18, you need 1.19+
Yes… you are correct in saying that I intentionally “introduced a bug” by disturbing the internal state, because it’s very difficult to reproduce “use of closed network connection” on demand (it may require high load, etc.), so I thought as a last resort, I could disturb the internal state to mimic a broken TCP connection… only to verify if the HealthCheck patch could help automatically recover kubelet from the “use of closed network connection” broken state.
I was also reading https://github.com/golang/go/issues/39337 just now, seems like with https://go-review.googlesource.com/c/net/+/240337 , even without the periodical connection healthcheck, broken TCP connections can be
MarkDead
ed pretty quick, and kubelet can recover. I just knew about thishttp2: close Transport connection on write errors
patch, seems like it’s only in the later versions of K8S1.19.x
/1.20.x
. We were using K8S 1.18.8 so we suffered from it. I think this patch really helps.Your comments are really insightful, thank you very much for your work and information on this issue! Really appreciate it 👍 @povsister
I have actually now succeeded in reproducing this reliably:
In my test cluster with 10 nodes, 3 nodes hit this bug when I did this on the first try. So it is still intermittent but seems easy enough to reproduce in clusters with more nodes.
A colleague also pointed out that it seems like all of the logged errors are coming from watchers, so that might be a hint to resolving this.
@zerkms, how long does it take to recover? If it’s < 45s, then it’s expected. Otherwise we have more to debug.
Temp workaround for those who are running K8S on VMWare vSphere - disable DRS for the K8S VMs, that’ll prevent vSphere from moving VMs among hypervisors thus eliminating any network disconnections which are causing troubles to the Kubelets
We’re running Prometheus federation for each of our Kubernetes clusters. Prometheus 2.19.0 introduced http2 (they forget to mention this in the changelog though and had it well-hidden in a commit message body, I had to git bisect, deploy and wait some hours for each run…) and we observed about a dozen incidents with federation stuck per day. I first patched out http2 support again (and issue was gone), and then set the read timeout directly in golang/net/x/http2. Since then, we did not have a single federation down incident any more.
I’m currently preparing to roll out a patched Kubernetes release on some clusters, so we should have data in a few days. We’ll definitely share our outcomes as soon as we have proper data.
– Jens Erat <jens.erat@daimler.com>, Daimler TSS GmbH
Imprint
I figured it out: the deciding factor is the golang version. If I use golang 1.13.15 (which is the default golang version for K8S 1.18) to compile kubelet, then kubelet will be stuck at “use of closed network connection” forever, never recovering.
If I use 1.14.15 however, kubelet will recover automatically within 2 minutes.
I haven’t figured out the root cause yet, may be due to some
~/go/src/net/
code change.It’s also possible to reproduce without hacking kubelet, we can hack APIServer, but it produces a slightly different error message
The symptoms are 100% the same: No TCP connection made to APIServer, stuck forever, has to be restarted.
hack
/usr/local/go/src/net/http/server.go
@caesarxuchao thanks… this helps me lot
@caesarxuchao Can you elaborate a little more? I’m confused what #96770, titled “Use k8s.gcr.io/e2e-test-images instead of gcr.io/kubernetes-e2e-test-images for e2e test images”, has to do with this bug.
Did you mean #96778?
Is there a plan to cherry-pick #95981 to 1.17 release branch?
I had this issue but I’ve now “fixed” it in my small hobby cluster using the workaround in a comment above
I wrote a small ansible-playbook to deploy the workaround on nodes as a systemd unit and timer, might save others with a similar setup some time
I have the same issue in v1.17.3.
What I found is that the k8s version using a specific
golang.org/x/net
version is in trouble, and this package seems to be fixed. https://go-review.googlesource.com/c/net/+/198040Version with this problem (v1.16.5 ~ latest release)
golang.org/x/net v0.0.0-20191004110552-13f9640d40b9
Fix Version (master branch)
golang.org/x/net v0.0.0-20200707034311-ab3426394381
Will updating
golang.org/x/net
package fix this issue?Is there a release planned for maintained k8s version(v1,16, 1.17, v1,18…) to fix this?
Are there any other workarounds besides the kubelet restart cron job? We have had a cron job in place for a week and it has not stopped the problem from happening.
@warmchang I think this applies to apiextensions apiservers and the sample apiserver: https://github.com/kubernetes/kubernetes/blob/ae1103726f9aea1f9bbad1b215edfa47e0747dce/staging/src/k8s.io/apiserver/pkg/server/options/recommended.go#L62
A test with curl test without setting
--http2-max-streams-per-connection
has this in our apiserver logs (using v1.16):I0603 10:18:08.038531 1 flags.go:33] FLAG: --http2-max-streams-per-connection="0"
And a curl request shows this in the response:
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
When I use
--http2-max-streams-per-connection=1000
the curl request then shows* Connection state changed (MAX_CONCURRENT_STREAMS == 1000)!
I’m seeing an identical issue on 1.16.8 - every so often (sometimes once every couple of days, sometimes every couple of weeks) the node becomes NotReady, with reason Kubelet stopped posting node status, and “use of closed network connection” filling the logs