kubernetes: Timeouts when draining a node while using external-traffic: OnlyLocal annotation on loadbalancer
/kind bug
What happened:
While trying to achieve zero downtime the current practice seem to be adding a preStop sleep hook (see https://github.com/kubernetes/ingress/issues/322 & https://github.com/kubernetes/kubernetes/issues/43576) to prevent pods being terminated before their endpoints have been removed. This works well except for when it’s combined with a loadbalancer with the service.beta.kubernetes.io/external-traffic: OnlyLocal annotation.
What you expected to happen:
Doing a rolling update of the ingress or draining a node should not cause timeouts.
How to reproduce it (as minimally and precisely as possible):
A minimal setup would be something like an ingress-controller and a backend deployment, both on at least 2 replicas and a preStop hook executing a small sleep. PodDisruptionBudgets with minAvailable: 1. And finally a loadbalancer service.
Execute some form of stress test while you drain a node which hosts an ingress-controller and there shouldn’t be any disruptions.
Now add the OnlyLocal annotation and repeat the test. Timeouts occur.
Environment:
- Kubernetes version (use
kubectl version): v1.6.4 - Cloud provider or hardware configuration**: Azure
- OS (e.g. from /etc/os-release): CoreOS 1465.6.0
- Kernel (e.g.
uname -a): 4.12.7-coreos
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 9
- Comments: 15 (6 by maintainers)
We are seeing the same issue. Based on our debugging it looks like:
A nice fix for us would be an ability to configure the time between TERM and kube-proxy removing routing rules. Then after TERM LB will have enough time to realise that node is not healthy before kube-proxy starts blackholing traffic. @thockin I saw you were on a number of discussions around non disruptive rollouts. I was wondering what’s your opinion about this.
Still a valid issue. /remove-lifecycle rotten
Any news on this?