kubernetes: (1.17) Kubelet won't reconnect to Apiserver after NIC failure (use of closed network connection)

We’ve just upgrade our production cluster to 1.17.2.

Since the update on saturday, we’ve had this strange outage: Kubelet, after a NIC bond fail (that recovers not long after), will have all of its connections broken and won’t retry to restablish them unless manually restarted.

Here is the timeline of last time it occured:

01:31:16: Kernel recognizes a fail on the bond interface. It goes for a while. Eventually it recovers.

Jan 28 01:31:16 baremetal044 kernel: bond-mngmt: link status definitely down for interface eno1, disabling it
...
Jan 28 01:31:37 baremetal044  systemd-networkd[1702]: bond-mngmt: Lost carrier
Jan 28 01:31:37 baremetal044  systemd-networkd[1702]: bond-mngmt: Gained carrier
Jan 28 01:31:37 baremetal044  systemd-networkd[1702]: bond-mngmt: Configured

As expected, all watches are closed. Message is the same for them all:

...
Jan 28 01:31:44 baremetal044 kubelet-wrapper[2039]: W0128 04:31:44.352736    2039 reflector.go:326] object-"namespace"/"default-token-fjzcz": watch of *v1.Secret ended with: very short watch: object-"namespace"/"default-token-fjzcz": Unexpected watch close - watch lasted less than a second and no items received
...

So these messages begin:

`Jan 28 01:31:44 baremetal44 kubelet-wrapper[2039]: E0128 04:31:44.361582 2039 desired_state_of_world_populator.go:320] Error processing volume "disco-arquivo" for pod "pod-bb8854ddb-xkwm9_namespace(8151bfdc-ec91-48d4-9170-383f5070933f)": error processing PVC namespace/disco-arquivo: failed to fetch PVC from API server: Get https://apiserver:443/api/v1/namespaces/namespace/persistentvolumeclaims/disco-arquivo: write tcp baremetal44.ip:42518->10.79.32.131:443: use of closed network connection`

Which I’m guessing shouldn’t be a problem for a while. But it never recovers. Our event came to happen at 01:31 AM, and had to manually restart Kubelet around 9h to get stuff normalized.

# journalctl --since '2020-01-28 01:31'   | fgrep 'use of closed' | cut -f3 -d' ' | cut -f1 -d1 -d':' | sort | uniq -dc
   9757 01
  20663 02
  20622 03
  20651 04
  20664 05
  20666 06
  20664 07
  20661 08
  16655 09
      3 10

Apiservers were up and running, all other nodes were up and running, everything else pretty uneventful. This one was the only one affected (today) by this problem.

Is there any way to mitigate this kind of event?

Would this be a bug?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 45
  • Comments: 163 (85 by maintainers)

Commits related to this issue

Most upvoted comments

I’ve fixed it by running this bash script every 5 minutes:

#!/bin/bash
output=$(journalctl -u kubelet -n 1 | grep "use of closed network connection")
if [[ $? != 0 ]]; then
  echo "Error not found in logs"
elif [[ $output ]]; then
  echo "Restart kubelet"
  systemctl restart kubelet
fi

After upgrading to Kubernetes 1.16, we also started seeing the use of closed network connection error and kubelet not reconnecting to the apiserver, leaving nodes stuck in NotReady. We weren’t able to reproduce the issue by taking down NICs (by setting the links down/up), but we did notice this behavior only happened on clusters that were more heavily loaded.

We did more digging and found that the server-side default in golang is 250 http2 streams per client, while the client-side default is 1000, so my guess is that once kubelet got an error from the apiserver for hitting the http2 stream limit, it never tried to reconnect. After setting --http2-max-streams-per-connection=1000 we didn’t see the problem with nodes getting stuck in NotReady as much as originally found during testing. This didn’t resolve the issue of kubelet not reconnecting, but it did help us mitigate the issue we were seeing.

Hi guys. I think finally I have tracked down this issue. We have the same issue happened last night. But successfully recovered with a modified kubelet.

It’s not a Kubernetes bug, it’s about golang’s standard net/http package which client-go is using,too. I believe there is a flaw in golang.org/x/net/http2/transport.go

Already have this reported to golang official. Waiting for some discussion. https://github.com/golang/go/issues/39750

For now I modified the code to have the http2: perform connection health check introduced by https://github.com/golang/net/commit/0ba52f642ac2f9371a88bfdde41f4b4e195a37c0 enabled by default. It proves to be some help on this problem. But a little slow-responded.

kubelet v1.17.6 logs (complied with self-modified golang.org/x/net package)

It do recovered from writing dead connections problem, but cost a little more time than expected.

Be noted that performing http2 healthCheck is a log message that I intended leave there to prove healthCheck func is called by readIdleTimer

Jun 23 03:14:45 vm10.company.com kubelet[22255]: E0623 03:14:45.912484   22255 kubelet_node_status.go:402] Error updating node status, will retry: error getting node "vm10.company.com": Get "https://vm10.company.com:8443/api/v1/nodes/vm10.company.com?timeout=10s": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:45 vm10.company.com kubelet[22255]: E0623 03:14:45.912604   22255 kubelet_node_status.go:402] Error updating node status, will retry: error getting node "vm10.company.com": Get "https://vm10.company.com:8443/api/v1/nodes/vm10.company.com?timeout=10s": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:45 vm10.company.com kubelet[22255]: E0623 03:14:45.912741   22255 kubelet_node_status.go:402] Error updating node status, will retry: error getting node "vm10.company.com": Get "https://vm10.company.com:8443/api/v1/nodes/vm10.company.com?timeout=10s": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:46 vm10.company.com kubelet[22255]: E0623 03:14:46.367046   22255 controller.go:135] failed to ensure node lease exists, will retry in 400ms, error: Get "https://vm10.company.com:8443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/vm10.company.com?timeout=10s": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:48 vm10.company.com kubelet[22255]: E0623 03:14:47.737579   22255 controller.go:135] failed to ensure node lease exists, will retry in 800ms, error: Get "https://vm10.company.com:8443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/vm10.company.com?timeout=10s": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:49 vm10.company.com kubelet[22255]: E0623 03:14:49.113920   22255 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:458: Failed to list *v1.Node: Get "https://vm10.company.com:8443/api/v1/nodes?fieldSelector=metadata.name%3Dvm10.company.com&limit=500&resourceVersion=0": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:49 vm10.company.com kubelet[22255]: E0623 03:14:48.744770   22255 reflector.go:153] object-"kube-system"/"flannel-token-zvfwn": Failed to list *v1.Secret: Get "https://vm10.company.com:8443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Dflannel-token-zvfwn&limit=500&resourceVersion=0": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:49 vm10.company.com kubelet[22255]: E0623 03:14:49.599631   22255 reflector.go:153] object-"kube-system"/"coredns": Failed to list *v1.ConfigMap: Get "https://vm10.company.com:8443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dcoredns&limit=500&resourceVersion=0": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:49 vm10.company.com kubelet[22255]: E0623 03:14:49.599992   22255 controller.go:135] failed to ensure node lease exists, will retry in 1.6s, error: Get "https://vm10.company.com:8443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/vm10.company.com?timeout=10s": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:49 vm10.company.com kubelet[22255]: E0623 03:14:49.600182   22255 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:449: Failed to list *v1.Service: Get "https://vm10.company.com:8443/api/v1/services?limit=500&resourceVersion=0": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:49 vm10.company.com kubelet[22255]: E0623 03:14:49.600323   22255 reflector.go:153] object-"kube-system"/"kube-flannel-cfg": Failed to list *v1.ConfigMap: Get "https://vm10.company.com:8443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dkube-flannel-cfg&limit=500&resourceVersion=0": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:49 vm10.company.com kubelet[22255]: E0623 03:14:49.600463   22255 reflector.go:153] object-"core"/"registrypullsecret": Failed to list *v1.Secret: Get "https://vm10.company.com:8443/api/v1/namespaces/core/secrets?fieldSelector=metadata.name%3Dregistrypullsecret&limit=500&resourceVersion=0": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:14:49 vm10.company.com kubelet[22255]: E0623 03:14:49.369097   22255 reflector.go:153] object-"kube-system"/"registrypullsecret": Failed to list *v1.Secret: Get "https://vm10.company.com:8443/api/v1/namespaces/kube-system/secrets?fieldSelector=metadata.name%3Dregistrypullsecret&limit=500&resourceVersion=0": write tcp 16.155.199.4:39668->16.155.199.4:8443: use of closed network connection
Jun 23 03:25:39 vm10.company.com kubelet[22255]: E0623 03:25:39.543880   22255 desired_state_of_world_populator.go:320] Error processing volume "deployment-log-dir" for pod "fluentd-h76lr_core(e95c9200-3a0c-4fea-bd7f-99ac1cc6ae7a)": error processing PVC core/itom-vol-claim: failed to fetch PVC from API server: Get "https://vm10.company.com:8443/api/v1/namespaces/core/persistentvolumeclaims/itom-vol-claim": read tcp 16.155.199.4:41512->16.155.199.4:8443: use of closed network connection
Jun 23 03:25:39 vm10.company.com kubelet[22255]: E0623 03:25:39.666303   22255 kubelet_node_status.go:402] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"MemoryPressure\"},{\"type\":\"DiskPressure\"},{\"type\":\"PIDPressure\"},{\"type\":\"Ready\"}],\"conditions\":[{\"lastHeartbeatTime\":\"2020-06-22T19:25:29Z\",\"type\":\"MemoryPressure\"},{\"lastHeartbeatTime\":\"2020-06-22T19:25:29Z\",\"type\":\"DiskPressure\"},{\"lastHeartbeatTime\":\"2020-06-22T19:25:29Z\",\"type\":\"PIDPressure\"},{\"lastHeartbeatTime\":\"2020-06-22T19:25:29Z\",\"type\":\"Ready\"}]}}" for node "vm10.company.com": Patch "https://vm10.company.com:8443/api/v1/nodes/vm10.company.com/status?timeout=10s": read tcp 16.155.199.4:41512->16.155.199.4:8443: use of closed network connection
Jun 23 03:25:49 vm10.company.com kubelet[22255]: E0623 03:25:49.553078   22255 kubelet_node_status.go:402] Error updating node status, will retry: error getting node "vm10.company.com": Get "https://vm10.company.com:8443/api/v1/nodes/vm10.company.com?timeout=10s": read tcp 16.155.199.4:41718->16.155.199.4:8443: use of closed network connection
Jun 23 03:25:49 vm10.company.com kubelet[22255]: E0623 03:25:49.560723   22255 desired_state_of_world_populator.go:320] Error processing volume "log-location" for pod "fluentd-h76lr_core(e95c9200-3a0c-4fea-bd7f-99ac1cc6ae7a)": error processing PVC core/itom-logging-vol: failed to fetch PVC from API server: Get "https://vm10.company.com:8443/api/v1/namespaces/core/persistentvolumeclaims/itom-logging-vol": read tcp 16.155.199.4:41718->16.155.199.4:8443: use of closed network connection
Jun 23 03:27:29 vm10.company.com kubelet[22255]: I0623 03:27:29.961600   22255 log.go:181] performing http2 healthCheck
Jun 23 03:31:32 vm10.company.com kubelet[22255]: I0623 03:31:31.829860   22255 log.go:181] performing http2 healthCheck
Jun 23 03:31:44 vm10.company.com kubelet[22255]: I0623 03:31:44.570224   22255 log.go:181] performing http2 healthCheck
Jun 23 03:32:13 vm10.company.com kubelet[22255]: I0623 03:32:12.961728   22255 log.go:181] performing http2 healthCheck
Jun 23 03:33:16 vm10.company.com kubelet[22255]: I0623 03:33:15.441808   22255 log.go:181] performing http2 healthCheck
Jun 23 03:33:28 vm10.company.com kubelet[22255]: I0623 03:33:28.233121   22255 log.go:181] performing http2 healthCheck

no more use of closed network connection reported and kubelet returns to Ready state

I have the same issue with a Raspberry Pi cluster with k8s 1.17.3 very often. Based on some older issues, i have set the kube API server http connection limit to 1000 “- --http2-max-streams-per-connection=1000”, it was fine for more than 2 weeks after that it starts now again.

I know this is silly, but must work as a temporary workaround:

Expand yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kubelet-face-slapper
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: kubelet-face-slapper
  template:
    metadata:
      labels:
        app: kubelet-face-slapper
    spec:
      # this toleration is to have the daemonset runnable on master nodes
      # remove it if your masters can't run pods	
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/controlplane
        operator: Equal
        value: "true"
      - effect: NoExecute
        key: node-role.kubernetes.io/etcd
        operator: Equal
        value: "true"
      containers:
      - command:
        - /bin/sh
        - -c
        - while true; do sleep 40; docker logs kubelet --since 1m 2>&1 | grep -q "use
          of closed network connection" && (docker restart kubelet ; echo "kubelet
          has been restarted due to connection error") || echo "kubelet connection
          is ok" ;done
        image: docker:stable
        name: kubelet-face-slapper
        volumeMounts:
        - mountPath: /var/run/docker.sock
          name: docker-sock
      volumes:
      - hostPath:
          path: /var/run/docker.sock
          type: File
        name: docker-sock

(This is rancher specific, but can be easily adapted to other distributions by using privileged container and journalctl/systemctl)

Amount of time for sleep and --since must be less than cluster’s pod-eviction-timeout (5m by default)

BTW - docker pause nginx-proxy on rancher worker node makes kubelet produce the same error message.

I am seeing this same behavior. Ubuntu 18.04.3 LTS clean installs. Cluster built with rancher 2.3.4. I have seen this happen periodically lately and just restarting kubelet tends to fix it for me. Last night all 3 of my worker nodes exhibited this same behavior. I corrected 2 to bring my cluster up. Third is still in this state while im digging around.

Hi, from my experiments, these two patches: https://go-review.googlesource.com/c/net/+/243257 https://go-review.googlesource.com/c/go/+/243258

from https://github.com/golang/go/issues/40213 are also very important, if we want to be 100% sure.

The reason is: func (t *Transport) newClientConn may fail: https://github.com/golang/net/blob/d523dce5a7f4b994f7ed0531dbe44cd8fd803e26/http2/transport.go#L700-L703

func (c *addConnCall) run(t *Transport, key string, tc *tls.Conn) will remember this error and pass it to upgradeFn through func (p *clientConnPool) addConnIfNeeded, which will return an erringRoundTripper{err} https://github.com/golang/net/blob/d523dce5a7f4b994f7ed0531dbe44cd8fd803e26/http2/client_conn_pool.go#L195-L197 https://github.com/golang/net/blob/d523dce5a7f4b994f7ed0531dbe44cd8fd803e26/http2/client_conn_pool.go#L177-L179 https://github.com/golang/net/blob/d523dce5a7f4b994f7ed0531dbe44cd8fd803e26/http2/transport.go#L192-L195

erringRoundTripper is terrible, its RoundTrip method will do nothing other than return the error: https://github.com/golang/net/blob/d523dce5a7f4b994f7ed0531dbe44cd8fd803e26/http2/transport.go#L2554

So, this connection with erringRoundTripper must be removed.

in src/net/http/transport.go, there is isH2DialError logic to remove it: https://github.com/golang/go/blob/5cf057ddedfbb149b71c85ec86050431dd6b2d9d/src/net/http/transport.go#L591-L596

However, it uses pconn.alt.(http2erringRoundTripper), which is different from erringRoundTripper, and this assertion returns false! And func (t *Transport) removeIdleConn(pconn *persistConn) bool is not called.

So the next time, in http’s Transport.roundTrip, it will still pick up the faulty pconn from Transport.idleConn. Transport.idleConn[x][0].alt is actually a http2erringRoundTripper, so it will err out without actually doing anything.

https://github.com/golang/go/blob/5cf057ddedfbb149b71c85ec86050431dd6b2d9d/src/net/http/transport.go#L565-L569 https://github.com/golang/go/blob/5cf057ddedfbb149b71c85ec86050431dd6b2d9d/src/net/http/transport.go#L1329 https://github.com/golang/go/blob/5cf057ddedfbb149b71c85ec86050431dd6b2d9d/src/net/http/transport.go#L1008 https://github.com/golang/go/blob/5cf057ddedfbb149b71c85ec86050431dd6b2d9d/src/net/http/transport.go#L1828-L1831

So, the culprit is pconn.alt.(http2erringRoundTripper), which is exactly what https://go-review.googlesource.com/c/net/+/243257 https://go-review.googlesource.com/c/go/+/243258 fixes.

We have some very good news regarding mitigation of the issue using the new golang http2 health check feature: no issues any more. By now, we implemented the “fix” (hardcoded setting of the value in vendored x/net code) in Prometheus, entire Kubernetes and several internal components, observing:

  • no Prometheus federation issues any more
  • kubelet sometimes still reports single “use of closed connection” events but recovers within seconds (we set a ~30 seconds http2 health check window)
  • sometimes we had issues with kubectl watches – also gone if using patched kubectl
  • we’re running an extended E2E testsuite to verify our integration regularly, and observed sporadic test timeouts and flakiness. Guess what? Gone now.

Furthermore, we have been able to get new insights on how to trigger the issues. I can confirm @vi7’s observation regarding live migration with some confidence (we could trace it down though), and at least with the NSX version we’re running also load balancer changes can trigger such issues (we have a ticket with VMware to make sure they’re sending reset packets in future). And very likely lots of other reasons dropping connections in-between, like connection table overflows.

This is a very annoying and somewhat massive issue for some users of Kubernetes (depending on some kind of “brokeness” of the IaaS layer/networking, I guess). Although there are golang discussions about exposing an interface to properly set the values – do you think there is any chance of getting a PR merged upstream setting those values through reflection (still better than forking x/net I guess like we do right now)? We’re fine with providing the code (and validating the fix, we cannot actually reproduce, but observe it often enough to be able to confirm whether the fix works).

If the underlying issue was “http2 using dead connections”, then restarting kubelet should fix the problem. https://github.com/kubernetes/kubernetes/pull/48670 suggested reducing TCP_USER_TIMEOUT can mitigate the problem. I have opened https://github.com/golang/net/pull/55 to add client-side connection health check to the http2 library, but it’s going to take more time to land.

If restarting kubelet didn’t solve the issue, then probably it’s a different root cause.

你好,你的邮件已收到,祝你身体健康,学业进步

Very impressive for your analysis. Nice work @KevinWang15 👍 If I understood the things correct, there is no further problem if you are just using the bundled http2 package in prior Go 1.16 release, because the type-assertion only fails with an explicit configured h2 transport, as Kubernetes does. So, for Kubernetes, it needs the fix from https://golang.org/cl/243258 on Go’s h1 stack. That’s a little hurt 😆

Currently we are using K8S 1.19.x from official distribution on our stack, and we did not observe any connection problem with kubelet. Though it’s probably very very rare that h2 handshake fails after a successful TLS handshake due to connection problem in production, eliminating every possible bug is still an instinct for engineer. Let’s do the hurt things for good.

Thank you @KevinWang15 for the great analysis!

I took a look at https://github.com/golang/go/issues/34978 (reopened as https://github.com/golang/go/issues/40213). A brief summary:

  • Triggering condition: a connection write error happens after the TLS handshake, but during the h2 handshake.
  • Symptoms compared to this issue (#87615)
    • similarities:
      • from client-go’s perspective, it sees persistent connection failures
      • the client recovers if it’s restarted
    • differences:
      • the log message probably won’t contain repeated “use of closed network connection”, it will be whatever error that happens during the h2 handshake.
      • [updated] starting in k8s 1.18, client-go will replace “use of closed network connection” with 500 internal error due to this change. Hence, even if the root cause is “h2 connection pool caching broken connection”, client logs might not contain repeated “use of closed network connection”.
  • It took a series of patches to fix. Notably,

More details of how https://github.com/golang/go/issues/34978 happens:

  • The erringRoundTripper returned here is wrongly cached in the idle connection pool of the HTTP/1 Transport.
  • h1 transport thinks this is a h2 client connection, so it doesn’t set an idle timer for it, so the erringRoundTripper stuck there forever.
  • erringRoundTripper is not a real h2 client connection, so the http2 health check won’t help recover it.

Another good news: Kubernetes do not use the bundled http2 in standard net/http package, so we do not need to wait for the next Go release. We can directly use https://github.com/golang/net/commit/08b38378de702b893ee869b94b32f833e2933bd2 to fix this issue.

Proposed a fix here. https://github.com/kubernetes/kubernetes/pull/95898 It updates dependency to required version and enables http2 Transport health check by default. It should help applications who use client-go to communicate with apiserver(eg: kubelet) get rid of “app hang at write tcp xxx: use of closed connection” problem.

Feel free to leave any comments.

Good news: golang/net master has support for configuring http2 transports, which now allows setting the timeouts! https://github.com/golang/net/commit/08b38378de702b893ee869b94b32f833e2933bd2

Just another case:

We are seeing this in Kube 1.16.8 clusters as well. Rebooting the VM can be used to bring the node back to a good state (I suspect a kubelet restart would have worked as well).

Our setup kubelet talks to a local haproxy instance over localhost which acts as a tcp load balancer to the multiple backend master instances. We are going to investigate if adding

option clitcpka    # enables keep-alive only on client side
option srvtcpka    # enables keep-alive only on server side

To our load balancer instances help alleviate the need for the explicit reboot and can lead to a full recovery. Example of repeated logs

Apr  8 00:04:25 kube-bnkjtdvd03sqjar31uhg-cgliksp01-cgliksp-00001442 kubelet.service[6175]: E0408 00:04:25.472682    6175 reflector.go:123] object-"ibm-observe"/"sysdig-agent": Failed to list *v1.ConfigMap: Get https://172.20.0.1:2040/api/v1/namespaces/ibm-observe/configmaps?fieldSelector=metadata.name%3Dsysdig-agent&limit=500&resourceVersion=0: write tcp 172.20.0.1:22501->172.20.0.1:2040: use of closed network connection
Apr  8 00:04:25 kube-bnkjtdvd03sqjar31uhg-cgliksp01-cgliksp-00001442 kubelet.service[6175]: E0408 00:04:25.472886    6175 reflector.go:123] object-"default"/"default-token-gvbk5": Failed to list *v1.Secret: Get https://172.20.0.1:2040/api/v1/namespaces/default/secrets?fieldSelector=metadata.name%3Ddefault-token-gvbk5&limit=500&resourceVersion=0: write tcp 172.20.0.1:22501->172.20.0.1:2040: use of closed network connection

Will post an update if that solves our specific problem in case that helps anyone here in the interim.

Curious if there’s a config parameter to set an absolute upper bound on a watch time? I found --streaming-idle-connection-timeout but nothing specific for watches.

same here.(ubuntu 18.04, kubernetes 1.17.3)

@cloud-66 I know, the point here is that a kubelet restart fixes the issue, so this is not fixed in 1.18.18, you need 1.19+

Yes… you are correct in saying that I intentionally “introduced a bug” by disturbing the internal state, because it’s very difficult to reproduce “use of closed network connection” on demand (it may require high load, etc.), so I thought as a last resort, I could disturb the internal state to mimic a broken TCP connection… only to verify if the HealthCheck patch could help automatically recover kubelet from the “use of closed network connection” broken state.

I was also reading https://github.com/golang/go/issues/39337 just now, seems like with https://go-review.googlesource.com/c/net/+/240337 , even without the periodical connection healthcheck, broken TCP connections can be MarkDeaded pretty quick, and kubelet can recover. I just knew about this http2: close Transport connection on write errors patch, seems like it’s only in the later versions of K8S 1.19.x / 1.20.x. We were using K8S 1.18.8 so we suffered from it. I think this patch really helps.

Your comments are really insightful, thank you very much for your work and information on this issue! Really appreciate it 👍 @povsister

I have actually now succeeded in reproducing this reliably:

  1. Delete the RBAC rules for the node certs so that nodes get unauthorized errors. Leave things like this for a few minutes.
  2. Bring back the RBAC rules
  3. Wait several minutes
  4. Go through every node in the cluster and look for ones that are continuously logging “use of closed connection” errors

In my test cluster with 10 nodes, 3 nodes hit this bug when I did this on the first try. So it is still intermittent but seems easy enough to reproduce in clusters with more nodes.

A colleague also pointed out that it seems like all of the logged errors are coming from watchers, so that might be a hint to resolving this.

@zerkms, how long does it take to recover? If it’s < 45s, then it’s expected. Otherwise we have more to debug.

Temp workaround for those who are running K8S on VMWare vSphere - disable DRS for the K8S VMs, that’ll prevent vSphere from moving VMs among hypervisors thus eliminating any network disconnections which are causing troubles to the Kubelets

@JensErat Do you have some concrete proof for enabling healthCheck helps resolve this issue ?

We’re running Prometheus federation for each of our Kubernetes clusters. Prometheus 2.19.0 introduced http2 (they forget to mention this in the changelog though and had it well-hidden in a commit message body, I had to git bisect, deploy and wait some hours for each run…) and we observed about a dozen incidents with federation stuck per day. I first patched out http2 support again (and issue was gone), and then set the read timeout directly in golang/net/x/http2. Since then, we did not have a single federation down incident any more.

I’m currently preparing to roll out a patched Kubernetes release on some clusters, so we should have data in a few days. We’ll definitely share our outcomes as soon as we have proper data.

– Jens Erat <jens.erat@daimler.com>, Daimler TSS GmbH
Imprint

Node can enter “NotReady”, then after 2 minutes, it will recover automatically… (but it still remains a mystery why it does NOT recover automatically sometimes)

I figured it out: the deciding factor is the golang version. If I use golang 1.13.15 (which is the default golang version for K8S 1.18) to compile kubelet, then kubelet will be stuck at “use of closed network connection” forever, never recovering.

If I use 1.14.15 however, kubelet will recover automatically within 2 minutes.

I haven’t figured out the root cause yet, may be due to some ~/go/src/net/ code change.


It’s also possible to reproduce without hacking kubelet, we can hack APIServer, but it produces a slightly different error message

E0316 07:20:39.114512 7272 kubelet_node_status.go:402] Error updating node status, will retry: error getting node “…”: Get https://…:6443/api/v1/nodes/…?timeout=10s: write tcp …:35322->…:6443: write: broken pipe

The symptoms are 100% the same: No TCP connection made to APIServer, stuck forever, has to be restarted.

hack /usr/local/go/src/net/http/server.go

func (srv *Server) Serve(l net.Listener) error {
...
		rw, e := l.Accept()

+		go func() {
+			time.Sleep(5 * time.Second)

+			if fileExists("/tmp/hack_break_connections") {
+				fmt.Printf("force closing connections from server side\n")
+				rw.Close()
+			}
+		}()

@caesarxuchao thanks… this helps me lot

@caesarxuchao Can you elaborate a little more? I’m confused what #96770, titled “Use k8s.gcr.io/e2e-test-images instead of gcr.io/kubernetes-e2e-test-images for e2e test images”, has to do with this bug.

Did you mean #96778?

Is there a plan to cherry-pick #95981 to 1.17 release branch?

I had this issue but I’ve now “fixed” it in my small hobby cluster using the workaround in a comment above

I wrote a small ansible-playbook to deploy the workaround on nodes as a systemd unit and timer, might save others with a similar setup some time

I have the same issue in v1.17.3.

What I found is that the k8s version using a specific golang.org/x/net version is in trouble, and this package seems to be fixed. https://go-review.googlesource.com/c/net/+/198040

Version with this problem (v1.16.5 ~ latest release) golang.org/x/net v0.0.0-20191004110552-13f9640d40b9

Fix Version (master branch) golang.org/x/net v0.0.0-20200707034311-ab3426394381

Will updating golang.org/x/net package fix this issue?

Is there a release planned for maintained k8s version(v1,16, 1.17, v1,18…) to fix this?

Are there any other workarounds besides the kubelet restart cron job? We have had a cron job in place for a week and it has not stopped the problem from happening.

@warmchang I think this applies to apiextensions apiservers and the sample apiserver: https://github.com/kubernetes/kubernetes/blob/ae1103726f9aea1f9bbad1b215edfa47e0747dce/staging/src/k8s.io/apiserver/pkg/server/options/recommended.go#L62

A test with curl test without setting --http2-max-streams-per-connection has this in our apiserver logs (using v1.16): I0603 10:18:08.038531 1 flags.go:33] FLAG: --http2-max-streams-per-connection="0"

And a curl request shows this in the response: * Connection state changed (MAX_CONCURRENT_STREAMS == 250)!

When I use --http2-max-streams-per-connection=1000 the curl request then shows * Connection state changed (MAX_CONCURRENT_STREAMS == 1000)!

I’m seeing an identical issue on 1.16.8 - every so often (sometimes once every couple of days, sometimes every couple of weeks) the node becomes NotReady, with reason Kubelet stopped posting node status, and “use of closed network connection” filling the logs