kubernetes: Kube-proxy remains running even if there are errors

What happened: This was the third time that I faced a problem with k8s networking and all of these problems were related to kube-proxy and one single error/bug: kube-proxy remains up and running even when it cannot bind to a port properly.

All these times finding the true problem was a nightmare and took hours and the first time even days to find it.

This causes many different problems including lose of connectione between pod-pod,pod-service and external-nodeport.

What you expected to happen: kube-proxy pods that cannot bind to an address fail/crash

How to reproduce it (as minimally and precisely as possible): kill the running kube-proxy, use the port that kube-proxy is using so when it restarts it cannot bind to the port that it needs.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.16.3
  • Cloud provider or hardware configuration: Bare metal: 1 master and 4 workers
  • OS (e.g: cat /etc/os-release): Ubuntu 18.04
  • Install tools: kubeadm
  • Network plugin and version (if this is a network-related bug): Calico
  • Others: Kube-proxy logs:
E1120 14:19:04.010132       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:09.010451       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:14.011746       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:18.049268       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:19:19.012147       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:24.012493       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:29.012862       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:34.013161       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:39.013597       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:44.013998       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:49.014372       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:54.014788       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:59.015056       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:20:04.015477       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:20:09.015844       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:20:14.016254       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:20:18.049791       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:20:19.016619       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:20:24.016969       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:20:29.017418       1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use

And another pod:

E1120 13:53:53.641275       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 13:54:53.641638       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 13:55:53.642044       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 13:56:53.642460       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 13:57:53.642823       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 13:58:53.643152       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 13:59:53.643600       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:00:53.644044       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:01:53.644497       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:02:53.644932       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:03:53.645320       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:04:53.645790       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:05:53.646226       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:06:53.646601       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:07:53.646937       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:08:53.647353       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:09:53.647855       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:10:53.648316       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:11:53.648768       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:12:53.649032       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:13:53.649422       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:14:53.649882       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:15:53.650321       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:16:53.650778       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:17:53.651217       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:18:53.651590       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:19:53.651942       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:20:53.652286       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:21:53.652707       1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use

kubectl get pods -n kube-system:

NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-564b6667d7-r2jh7   1/1     Running   0          3h58m
calico-node-5798s                          1/1     Running   0          3h42m
calico-node-6bzrh                          1/1     Running   0          3h40m
calico-node-csqvz                          1/1     Running   0          3h40m
calico-node-tdgc4                          1/1     Running   0          3h40m
calico-node-zwtcn                          1/1     Running   0          3h58m
coredns-5644d7b6d9-llrfv                   1/1     Running   0          3h59m
coredns-5644d7b6d9-zd6t8                   1/1     Running   0          3h59m
etcd-serflex-giant-7                       1/1     Running   0          3h58m
kube-apiserver-serflex-giant-7             1/1     Running   0          3h58m
kube-controller-manager-serflex-giant-7    1/1     Running   0          3h58m
kube-proxy-4m7j8                           1/1     Running   0          3h40m
kube-proxy-dpvlg                           1/1     Running   0          3h40m
kube-proxy-fmbrs                           1/1     Running   0          3h59m
kube-proxy-fwj8m                           1/1     Running   0          3h40m
kube-proxy-gld6v                           1/1     Running   0          3h42m
kube-scheduler-serflex-giant-7             1/1     Running   0          3h58m
metrics-server-788c5864d6-5f7vn            1/1     Running   0          3h24m
tiller-deploy-77855d9dcf-ft84b             1/1     Running   0          3h13m

/sig network

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

First, you are very welcome. I want to see people successful in their usages of Kubernetes.

I’ll leave this open for now, but there must have been some extra circumstances that we don’t know. If someone else can repro this, we can re-examine it.

On Sun, Mar 22, 2020, 1:35 PM avarf notifications@github.com wrote:

Thank you for your time and efforts for trying to repro this. Unfortunately, I don’t have any more logs because this happened months ago and when I just killed the process that was using the port and restart the kube-proxy problem was solved.

I faced this at least two times and both times I can confirm that all the nodes were up, running and ready, all the pods were also up and ready but the traffic is not going through that plus the fact taht freeing the port and restarting the kube-proxy solved the issue I assumed that it is related to that.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/85483#issuecomment-602268249, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVHXZWODNCHZZWPG7TLRIZZCZANCNFSM4JPTVDGA .

@avarf I confirm that kube-proxy does not restart when the metrics port can not be bound. The problem with “fixing” that is that it converts a “soft” fail into a “hard” fail, and there are very likely people who are surviving the soft fail today without really noticing.

That said, if you have healthz and metrics on the same port, it might cause the proxy to get restarted.

None of that should cause data-plane outages. I just tested that manually, and iptables seems just fine.

I would be fine to see a new config that converts metrics failure-to-bind into a fatal, but it has to be opt-in. I’ll open a new bug for that. Other than that, I’m afraid I don’t know what I can do with this report. 😦