kubernetes: Race condition between setting node taints and scheduling
What happened: Pod was scheduled on newly created node with condition Ready equal to false. What you expected to happen: Pod shouldn’t be scheduled until Ready condition is true. How to reproduce it (as minimally and precisely as possible): Run cluster with Cluster Autoscaler enabled. Add pod which needs to scale up cluster. Watch nodes and pods objects when new node is added. Anything else we need to know?: Scheduler is not using conditions anymore but it uses taints instead (like node.kubernetes.io/not-ready) which are not added when Kubelet registers node for the first time: https://github.com/kubernetes/kubernetes/blob/2b96a6074243cf39293fc294b20d3b8c97d3daca/pkg/kubelet/kubelet_node_status.go#L211 After some time taints are added but pod can be already scheduled.
Environment:
- Kubernetes version (use
kubectl version
): 1.12.3-gke.1 - Cloud provider or hardware configuration: GKE
/kind bug
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 3
- Comments: 35 (28 by maintainers)
We had a meeting at Google to brainstorm possible solutions. @liggitt suggested adding the logic to the API server admission to taint new nodes with “not ready” when TaintNodesByCondition is enabled. This seems to be the best solution that covers both new clusters and existing clusters upgraded to 1.12+. This solution works with version skew and older kubelets.
@krzysztof-jastrzebski
The simplest way to solve your problem is starting kubelet with options --register-with-taints=node.kubernetes.io/not-ready:=NoSchedule.
we can consider to add default value node.kubernetes.io/not-ready:=NoSchedule to option –register-with-taints.
@szuecs , it’s better to disable
TaintNodeByCondition
in your cluster; so kube-scheduler will check node’s not-ready condition.We should not do that; it’s better to let kubelet update taints for conditions to avoid such kind of race condition. TaintNodeByCondition is required to tolerate NotReady status that introduced by network.
@szuecs we’ll try to fix it asap, but we can’t give an ETA due to this holiday.
Yes, when it’s ready in
master
branch, we will also get it back ported to each releases, so that it will be available in next minor version.