kubernetes: kubectl cordon causes downtime of ingress(nginx)
Is this a BUG REPORT or FEATURE REQUEST?:
Uncomment only one, leave it on its own line:
/kind bug
/kind feature
What happened: In order to migrate to a new version of kubernetes on GKE. I cordoned(Kubectl cordon <node>) all nodes in the old node pool(1.8.9) in order to drain the node pool(pods on the node) with an intention to cause the pods to run on a new version of the node pool(1.10.2). As soon as I cordoned all the nodes in the old node pool, it caused the downtime of ingress(nginx). The nginx controller pods where running fine on the cordoned node pool. But the cordon caused the nodes to be removed from the target pool of ingress load balancer. Below are the documents that suggest the above same method that I followed as best practice for zero downtime upgrade.
1). https://cloud.google.com/kubernetes-engine/docs/tutorials/migrating-node-pool 2). https://cloudplatform.googleblog.com/2018/06/Kubernetes-best-practices-upgrading-your-clusters-with-zero-downtime.html
What you expected to happen:
I expected “kubectl cordon <node>” not to have the node/s removed from the target pool of the loadbalancer because the nginx controller pods where running absolutely fine on these nodes.
How to reproduce it (as minimally and precisely as possible): 1).Cordon all the nodes in an old node pool that runs nginx controller.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): 1.10.2 - Cloud provider or hardware configuration: GKE
- OS (e.g. from /etc/os-release): Container OS
- Kernel (e.g.
uname -a
): 4.14.22+ - Install tools:
- Others: Cordon causes to remove the nodes from the target pool of the ingress load balancer. The health checks fail on the nodes of the target pool, which should not be the case. Because the nginx controller pods run fine on these nodes.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 33
- Comments: 46 (24 by maintainers)
Commits related to this issue
- Add note for LB behaviour for cordoned nodes. See also https://github.com/kubernetes/kubernetes/issues/65013 This is a reasonably common pitfall: `kubectl cordon <all nodes>` will also drop all LB t... — committed to MMeent/website by MMeent 4 years ago
- Add note for LB behaviour for cordoned nodes. (#18784) * Add note for LB behaviour for cordoned nodes. See also https://github.com/kubernetes/kubernetes/issues/65013 This is a reasonably common pit... — committed to kubernetes/website by MMeent 4 years ago
- Sync up between dev-1.18 and master branches (#19055) * Fixed outdated ECR credential debug message (#18631) * Fixed outdated ECR credential debug message The log message for troubleshooting kubele... — committed to kubernetes/website by VineethReddy02 4 years ago
- Sync up between dev-1.18 and master branches (#19055) * Fixed outdated ECR credential debug message (#18631) * Fixed outdated ECR credential debug message The log message for troubleshooting kubele... — committed to kbhawkey/website by VineethReddy02 4 years ago
- Add note for LB behaviour for cordoned nodes. (#18784) * Add note for LB behaviour for cordoned nodes. See also https://github.com/kubernetes/kubernetes/issues/65013 This is a reasonably common pit... — committed to wawa0210/website by MMeent 4 years ago
- Official 1.18 Release Docs (#19116) * Requesting for Approve Permisssions (#18550) As I will be part of kubernetes 1.18 docs release team. Approve permissions will help me in approving the 1.18 docs... — committed to kubernetes/website by VineethReddy02 4 years ago
- correct translation error correct translation error coorrecttranslation error Adding pt/docs/contribute/_index (#19700) * Adding contribute/_index * Adding contribute/_index * Update content/pt/... — committed to fancc/website by fancc 4 years ago
- 添加 device-plugins中文翻译 correct translation error correct translation error coorrecttranslation error Adding pt/docs/contribute/_index (#19700) * Adding contribute/_index * Adding contribute/_inde... — committed to fancc/website by fancc 4 years ago
- Official 1.18 Release Docs (#19116) * Requesting for Approve Permisssions (#18550) As I will be part of kubernetes 1.18 docs release team. Approve permissions will help me in approving the 1.18 docs... — committed to fancc/website by VineethReddy02 4 years ago
This just caused a major outage for us too.
I agree with the previous posts, while this may be working as the code intended, it does not follow the patterns and documentation k8s provides. When I cordon a node it should not impact the traffic going to the node or any pods on it. I should be able to gradually remove traffic by draining nodes/pods and respecting PDBs.
Disagree. Unschedulable means “no new workloads”. It has nothing to do with existing workloads, that is why we have drain (but we should have something in kube that communicates nodes are draining).
I have opened #90823 to remove this check - we have an alternate approach now and the old logic is wrong (the comment in the code even says “this is for masters!” and that’s wrong now too as part of the kep).
This just surprised us also. I understand the technical reasons behind it but the command name
cordon
and associated documentation is misleading. To me it isn’t intuitive that the commandcordon
does not remove compute (pods) but does remove some networking. I would expect cordon to keep new pods from getting scheduled to it and keep it from getting added to new services/ELBs/etc. Thedrain
command would then cause pods to be removed and also take the node out of networking paths. Almost need a new state for a node ofoffline
that is clearly defined.This is a bug with service load balancers - not something we should document. We defined a label in https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/2019-07-16-node-role-label-use.md#service-load-balancer that cluster admins can use - Unschedulable has nothing to do with SLB inclusion.
node.kubernetes.io/exclude-from-external-load-balancers
which is beta and on by default in 1.19.My main issue is with the docs, and also GKE docs. They should explain this pitfall. After I got aware of the issue I understand why it works as it does.
A simple workaround is to taint the nodes you are going to rotate first.
This doesn’t address the network issues you’re seeing, but for those trying to “roll their own cordon” using taints:
Using a NoExecute taint to “drain” the node won’t be orderly and won’t respect any PDBs. You want to replicate the logic from
kubectl drain
which tries to do evictions of every pod in a loop. (I realize this is a burdensome idea, but it is also the safest idea.)We wanted to replace all of the kubernetes cluster nodes to different node instance types, so we started off by cordoning all of the nodes, but it surprisingly caused an outage for us.
As @joekohlsdorf pointed out, the only documentation around cordoning is
Mark node as unschedulable
, which I took to mean that new pods can not be scheduled on those nodes. It was surprising behavior to find out that, not only does it not allow new pods to be scheduled on those nodes, but that it also causes the service controller to remove unschedulable nodes from load balancing pools. I understand that it’s working as intended, but if that was also documented as part of the cordon operation, I would have been able to avoid an outage.I’ve posted a question regarding this issue in SO some weeks before it was created.
https://stackoverflow.com/questions/50488507/why-does-http-load-balancer-forwarding-rules-exclude-cordoned-gke-nodes/50515954
The suggested action was to use node taints (
NoSchedule
) instead of cordon, that way nodes will still be marked asReady
.Is there a plan for cordon behavior to improve on GCP? We have relied on this behavior for years on our AWS clusters and it was very surprising to find out.
Besides running a test like the @kevinkim9264 above, how would I determine what the behavior is on AWS? I haven’t found anything in the node administration docs or the
aws-load-balancer
Service that would imply thatcordon
would shift traffic away from aReady
Pod. In fact, the doc for Manual Node Administration explicitly states, emphasis mine:If I hadn’t come across this Issue due to reports from GKE, I suppose it would only have been a matter of time until we had a high-impact production outage. There’s a disconnect between Kubernetes API and the LoadBalancer as to whether Pods running on a
cordon
’d node are Ready. From all the docs I’ve seen, I would expect that either a)cordon
would evict Pods over to aReady
Node before setting the instance asOutOfService
, or b) LoadBalancer would not equateReady,SchedulingDisabled
withOutOfService
It looks like this happens only to AWS and GCP but not on Azure? Can anyone confirm this?
============ I just tested in both AWS and Azure, and while I see logs like
in AWS Kubernetes, I do not see any logs like that in Azure. Is it intended? If so isn’t behavior different for each cloud provider?
i.e.) cordoning a node in AWS Kubernetes prevents loadbalancer from routing traffic to the affected node but cordoning a node in Azure Kubernetes will still let the loadbalancer route traffic to the affected node. Is it intentional?
Just chiming in; We as a GKE customer have today been bitten by this issue in production. The documentation of
kubectl cordon
(in the context of GKE) is not extensive enough so we could be aware of this issue. This should be addressed.Luckily the outage was not severe because we were using the node-by-node approach (and accepted the container churn this would cause). We explicitly took this approach because our previous nodepool upgrade was a total outage also caused by
kubectl cordon
in a different way.(for those interested in our total outage, not related to this issue): Due to the fact that
kubectl cordon
also results in thekube-dns-autoscaler
re-calculating the amount of schedulable nodes and thus scaled the amount ofkube-dns
pods in our cluster (in our case: from 110 back to 2) resulting for us in major DNS outage internally.Be warned that
kubectl cordon
for a nodepool upgrade can have a lot of unwanted & unexpected side effects.@thockin IIUC, we don’t want new pods to land on the tainted/cordoned nodes. In that case, we can place “NoSchedule” taint on the nodes. We don’t need any tolerations on nginx. This will prevent new instances of nginx from landing on the tainted nodes, but the existing ones will keep running there.