kubernetes: Pods in Node are deleted even if the NoExecute taint that triggered deletion is removed
What happened:
A Pod was scheduled for deletion because a taint w/ effect NoExecute was observed in the Node the Pod was assigned to, and after the taint is removed from the Node, the Pod scheduled deletion isn’t canceled.
After scheduling pod deletion, the NoExecuteTaintManager only cancels pod deletion if all NoExecute taints are removed from the node, including any taints that the user relies on for their use-cases (example below).
What you expected to happen:
After the taint that triggered scheduling pod deletion is removed, NoExecuteTaintManager cancels the scheduled pod deletion.
How to reproduce it (as minimally and precisely as possible):
- A node
MyNodeis tainted w/OnlyForMyUsage:NoExecute, so any workloads already assigned to it which don’t tolerate this taint are evicted. - A pod
MyPodis created w/ tolerations:OnlyForMyUsage:NoExecutew/ unspecifiedtolerationSecondsandNotReady:NoExecutew/tolerationSeconds: 300. MyPodis assigned toMyNode.- At some point in time after the previous steps took place,
MyNodeis tainted asNotReady:NoExecute. NoExecuteTaintManagergets an update event forMyNodeand observes twoNoExecutetaints. It proceeds to calculate the minimum time (in seconds) thatMyPodtolerates for the two taints which at this point is 300 seconds, and marksMyPodfor deletion in ~300 seconds - this happens in-memory, in a timed-queue.- At some point in time after the previous steps took place, and before the 300 seconds trigger happen,
NotReady:NoExecutetaint is removed fromMyNode. NoExecuteTaintManagergets an update event forMyNodeand observes only oneNoExecutetaint,OnlyForMyUsage:NoExecute. It proceeds to calculate the minimum time (in seconds) thatMyPodtolerates for the this taint which is infinity, and returns, not canceling the previous deletion. It completely ignores the fact that the taint that triggeredMyPoddeletion is no longer observed.
Anything else we need to know?:
- Worked w/ @bobveznat to figure this out after all pods in a node got evicted without apparent reason.
Environment:
- Kubernetes version (use
kubectl version): 1.17.3 (but looking at the code seems to affect all of 1.17.x and 1.18.x and master. - Cloud provider or hardware configuration: N/A
- OS (e.g:
cat /etc/os-release): N/A - Kernel (e.g.
uname -a): 5.x - Install tools: N/A
- Network plugin and version (if this is a network-related bug): N/A
- Others: N/A
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 21 (20 by maintainers)
Looks like we lost some momentum here and I’d like to avoid this falling into the abyss of bugs that never get fixed. When this bug is triggered the outcome is pretty terrible and once it does happen it is very difficult to dig into and understand what actually happened. Did we ever get this routed to the right owner?
API machinery owns the mechanics of the controller manager (controller loop setup and management), but the individual controllers are SIG-specific
I think node lifecycle is jointly owned by sig-node and sig-cloud-provider