kubernetes: kubectl cordon causes downtime of ingress(nginx)

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened: In order to migrate to a new version of kubernetes on GKE. I cordoned(Kubectl cordon <node>) all nodes in the old node pool(1.8.9) in order to drain the node pool(pods on the node) with an intention to cause the pods to run on a new version of the node pool(1.10.2). As soon as I cordoned all the nodes in the old node pool, it caused the downtime of ingress(nginx). The nginx controller pods where running fine on the cordoned node pool. But the cordon caused the nodes to be removed from the target pool of ingress load balancer. Below are the documents that suggest the above same method that I followed as best practice for zero downtime upgrade.

1). https://cloud.google.com/kubernetes-engine/docs/tutorials/migrating-node-pool 2). https://cloudplatform.googleblog.com/2018/06/Kubernetes-best-practices-upgrading-your-clusters-with-zero-downtime.html

What you expected to happen:

I expected “kubectl cordon <node>” not to have the node/s removed from the target pool of the loadbalancer because the nginx controller pods where running absolutely fine on these nodes.

How to reproduce it (as minimally and precisely as possible): 1).Cordon all the nodes in an old node pool that runs nginx controller.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.10.2
  • Cloud provider or hardware configuration: GKE
  • OS (e.g. from /etc/os-release): Container OS
  • Kernel (e.g. uname -a): 4.14.22+
  • Install tools:
  • Others: Cordon causes to remove the nodes from the target pool of the ingress load balancer. The health checks fail on the nodes of the target pool, which should not be the case. Because the nginx controller pods run fine on these nodes.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 33
  • Comments: 46 (24 by maintainers)

Commits related to this issue

Most upvoted comments

This just caused a major outage for us too.

I agree with the previous posts, while this may be working as the code intended, it does not follow the patterns and documentation k8s provides. When I cordon a node it should not impact the traffic going to the node or any pods on it. I should be able to gradually remove traffic by draining nodes/pods and respecting PDBs.

The presumption is that a node that is unschedulable is going away, or needs some other repair.

Disagree. Unschedulable means “no new workloads”. It has nothing to do with existing workloads, that is why we have drain (but we should have something in kube that communicates nodes are draining).

I have opened #90823 to remove this check - we have an alternate approach now and the old logic is wrong (the comment in the code even says “this is for masters!” and that’s wrong now too as part of the kep).

This just surprised us also. I understand the technical reasons behind it but the command name cordon and associated documentation is misleading. To me it isn’t intuitive that the command cordon does not remove compute (pods) but does remove some networking. I would expect cordon to keep new pods from getting scheduled to it and keep it from getting added to new services/ELBs/etc. The drain command would then cause pods to be removed and also take the node out of networking paths. Almost need a new state for a node of offline that is clearly defined.

This is a bug with service load balancers - not something we should document. We defined a label in https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/2019-07-16-node-role-label-use.md#service-load-balancer that cluster admins can use - Unschedulable has nothing to do with SLB inclusion.

node.kubernetes.io/exclude-from-external-load-balancers which is beta and on by default in 1.19.

My main issue is with the docs, and also GKE docs. They should explain this pitfall. After I got aware of the issue I understand why it works as it does.

A simple workaround is to taint the nodes you are going to rotate first.

This doesn’t address the network issues you’re seeing, but for those trying to “roll their own cordon” using taints:

Using a NoExecute taint to “drain” the node won’t be orderly and won’t respect any PDBs. You want to replicate the logic from kubectl drain which tries to do evictions of every pod in a loop. (I realize this is a burdensome idea, but it is also the safest idea.)

We wanted to replace all of the kubernetes cluster nodes to different node instance types, so we started off by cordoning all of the nodes, but it surprisingly caused an outage for us.

As @joekohlsdorf pointed out, the only documentation around cordoning is Mark node as unschedulable, which I took to mean that new pods can not be scheduled on those nodes. It was surprising behavior to find out that, not only does it not allow new pods to be scheduled on those nodes, but that it also causes the service controller to remove unschedulable nodes from load balancing pools. I understand that it’s working as intended, but if that was also documented as part of the cordon operation, I would have been able to avoid an outage.

I’ve posted a question regarding this issue in SO some weeks before it was created.

https://stackoverflow.com/questions/50488507/why-does-http-load-balancer-forwarding-rules-exclude-cordoned-gke-nodes/50515954

The suggested action was to use node taints (NoSchedule) instead of cordon, that way nodes will still be marked as Ready.

Is there a plan for cordon behavior to improve on GCP? We have relied on this behavior for years on our AWS clusters and it was very surprising to find out.

Besides running a test like the @kevinkim9264 above, how would I determine what the behavior is on AWS? I haven’t found anything in the node administration docs or the aws-load-balancer Service that would imply that cordon would shift traffic away from a Ready Pod. In fact, the doc for Manual Node Administration explicitly states, emphasis mine:

Marking a node as unschedulable prevents new pods from being scheduled to that node, but does not affect any existing pods on the node. This is useful as a preparatory step before a node reboot, etc. For example, to mark a node unschedulable, run this command: `kubectl cordon $NODENAME``

If I hadn’t come across this Issue due to reports from GKE, I suppose it would only have been a matter of time until we had a high-impact production outage. There’s a disconnect between Kubernetes API and the LoadBalancer as to whether Pods running on a cordon’d node are Ready. From all the docs I’ve seen, I would expect that either a) cordon would evict Pods over to a Ready Node before setting the instance as OutOfService, or b) LoadBalancer would not equate Ready,SchedulingDisabled with OutOfService

It looks like this happens only to AWS and GCP but not on Azure? Can anyone confirm this?

============ I just tested in both AWS and Azure, and while I see logs like

aws_loadbalancer.go:1361] Instances removed from load-balancer

in AWS Kubernetes, I do not see any logs like that in Azure. Is it intended? If so isn’t behavior different for each cloud provider?

i.e.) cordoning a node in AWS Kubernetes prevents loadbalancer from routing traffic to the affected node but cordoning a node in Azure Kubernetes will still let the loadbalancer route traffic to the affected node. Is it intentional?

Just chiming in; We as a GKE customer have today been bitten by this issue in production. The documentation of kubectl cordon (in the context of GKE) is not extensive enough so we could be aware of this issue. This should be addressed.

Luckily the outage was not severe because we were using the node-by-node approach (and accepted the container churn this would cause). We explicitly took this approach because our previous nodepool upgrade was a total outage also caused by kubectl cordon in a different way.

(for those interested in our total outage, not related to this issue): Due to the fact that kubectl cordon also results in the kube-dns-autoscaler re-calculating the amount of schedulable nodes and thus scaled the amount of kube-dns pods in our cluster (in our case: from 110 back to 2) resulting for us in major DNS outage internally.

Be warned that kubectl cordon for a nodepool upgrade can have a lot of unwanted & unexpected side effects.

As a short-term workaround, you can taint the nodes before cordoning one-by-one. That will stop new work from arriving, but will leave old work there. If we agree on an appropriately named taint, we can have nginx tolerate it by default. Am I missing any reason why this would not work?

@thockin IIUC, we don’t want new pods to land on the tainted/cordoned nodes. In that case, we can place “NoSchedule” taint on the nodes. We don’t need any tolerations on nginx. This will prevent new instances of nginx from landing on the tainted nodes, but the existing ones will keep running there.