kubernetes: Kube-proxy remains running even if there are errors
What happened: This was the third time that I faced a problem with k8s networking and all of these problems were related to kube-proxy and one single error/bug: kube-proxy remains up and running even when it cannot bind to a port properly.
All these times finding the true problem was a nightmare and took hours and the first time even days to find it.
This causes many different problems including lose of connectione between pod-pod,pod-service and external-nodeport.
What you expected to happen: kube-proxy pods that cannot bind to an address fail/crash
How to reproduce it (as minimally and precisely as possible): kill the running kube-proxy, use the port that kube-proxy is using so when it restarts it cannot bind to the port that it needs.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): 1.16.3 - Cloud provider or hardware configuration: Bare metal: 1 master and 4 workers
- OS (e.g:
cat /etc/os-release
): Ubuntu 18.04 - Install tools: kubeadm
- Network plugin and version (if this is a network-related bug): Calico
- Others: Kube-proxy logs:
E1120 14:19:04.010132 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:09.010451 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:14.011746 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:18.049268 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:19:19.012147 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:24.012493 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:29.012862 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:34.013161 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:39.013597 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:44.013998 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:49.014372 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:54.014788 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:19:59.015056 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:20:04.015477 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:20:09.015844 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:20:14.016254 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:20:18.049791 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:20:19.016619 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:20:24.016969 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
E1120 14:20:29.017418 1 server.go:564] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
And another pod:
E1120 13:53:53.641275 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 13:54:53.641638 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 13:55:53.642044 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 13:56:53.642460 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 13:57:53.642823 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 13:58:53.643152 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 13:59:53.643600 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:00:53.644044 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:01:53.644497 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:02:53.644932 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:03:53.645320 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:04:53.645790 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:05:53.646226 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:06:53.646601 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:07:53.646937 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:08:53.647353 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:09:53.647855 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:10:53.648316 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:11:53.648768 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:12:53.649032 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:13:53.649422 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:14:53.649882 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:15:53.650321 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:16:53.650778 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:17:53.651217 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:18:53.651590 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:19:53.651942 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:20:53.652286 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
E1120 14:21:53.652707 1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
kubectl get pods -n kube-system:
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-564b6667d7-r2jh7 1/1 Running 0 3h58m
calico-node-5798s 1/1 Running 0 3h42m
calico-node-6bzrh 1/1 Running 0 3h40m
calico-node-csqvz 1/1 Running 0 3h40m
calico-node-tdgc4 1/1 Running 0 3h40m
calico-node-zwtcn 1/1 Running 0 3h58m
coredns-5644d7b6d9-llrfv 1/1 Running 0 3h59m
coredns-5644d7b6d9-zd6t8 1/1 Running 0 3h59m
etcd-serflex-giant-7 1/1 Running 0 3h58m
kube-apiserver-serflex-giant-7 1/1 Running 0 3h58m
kube-controller-manager-serflex-giant-7 1/1 Running 0 3h58m
kube-proxy-4m7j8 1/1 Running 0 3h40m
kube-proxy-dpvlg 1/1 Running 0 3h40m
kube-proxy-fmbrs 1/1 Running 0 3h59m
kube-proxy-fwj8m 1/1 Running 0 3h40m
kube-proxy-gld6v 1/1 Running 0 3h42m
kube-scheduler-serflex-giant-7 1/1 Running 0 3h58m
metrics-server-788c5864d6-5f7vn 1/1 Running 0 3h24m
tiller-deploy-77855d9dcf-ft84b 1/1 Running 0 3h13m
/sig network
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (7 by maintainers)
First, you are very welcome. I want to see people successful in their usages of Kubernetes.
I’ll leave this open for now, but there must have been some extra circumstances that we don’t know. If someone else can repro this, we can re-examine it.
On Sun, Mar 22, 2020, 1:35 PM avarf notifications@github.com wrote:
@avarf I confirm that kube-proxy does not restart when the metrics port can not be bound. The problem with “fixing” that is that it converts a “soft” fail into a “hard” fail, and there are very likely people who are surviving the soft fail today without really noticing.
That said, if you have healthz and metrics on the same port, it might cause the proxy to get restarted.
None of that should cause data-plane outages. I just tested that manually, and iptables seems just fine.
I would be fine to see a new config that converts metrics failure-to-bind into a fatal, but it has to be opt-in. I’ll open a new bug for that. Other than that, I’m afraid I don’t know what I can do with this report. 😦