kubernetes: Kube-proxy hangs on api-server errors

What happened: After some api-server outages (2 apparent in logs and they lastestd minutes at least) Kube proxy stopped recieving updates from endpoionts. And stopped updating the ip tables Here’s the kube-proxy logs which show errors and then nothing for 4 days (Log ended on 4/8/2019) kubehunglogs.txt Here’s the iptables and endpoints that were out of wack.
iptablesandendpoints.txt

We’ve had this happen twice so reporting it now though we’re still working on a repo. The core dns service being out of date is what really kills us as the dns lookup will just start hanging.

What you expected to happen: Either kube-proxy should continue retrying or fail so it can be restarted.

How to reproduce it (as minimally and precisely as possible): Take an api-server down for several minutes? We may try this today. Not oure api server is an aks instance where it is acually fqdn not a service in the cluster.

Anything else we need to know?: Sadly the pod was restarted but we’re curious if the healthz endpoint may help here. Maybe one stargety is check the last update time of the if the lastUpdated  time from healthz and see if we can  make a liveness prove kill kube-proxy if it has been hours since last update.

Environment:

  • Kubernetes version (use kubectl version): 1.12.5
  • Cloud provider or hardware configuration: Azure
  • OS (e.g: cat /etc/os-release): Ubuntu 16.04.5 LTS
  • Kernel (e.g. uname -a):4.15.0-1041-azure

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 66 (26 by maintainers)

Most upvoted comments

I think this may be resolved by https://github.com/kubernetes/kubernetes/pull/95981, which is in 1.20, and backported to 1.19.

OK. For backlog’s sake I am going to close this, but if it pops back up, please please please let us know.

@squeed did you folks see hangs in the kube proxy watch as well?

No, we something else weird - a missed transition in the informers. During APIserver + etcd disruption, we saw UpdateEndpoints(A -> B), then UpdateEndpoints(C->C). That shouldn’t be possible, and kube-proxy therefore ignored the change.

+1 we encountered this issue 2 days ago. solution was to restart the kube-proxy pods. is there any way to detect which AKS version (or K8s version) are affected by this ?

#65012 certainly is interesting as our api server is out of cluster. Was looking to see if I could add a last endpoints update time to the healthz endpoint here https://github.com/kubernetes/kubernetes/blob/f1693efe3713f065d33a5f0d31df0bec0a966bc0/pkg/proxy/iptables/proxier.go

If so I could at least have a liveness probe that would restart kube-proxy if it got stuck

We’re still trying to repro the hang by tearing the api server down repeatedly over night. May have more info tommorow.