kubernetes: Pods in Node are deleted even if the NoExecute taint that triggered deletion is removed
What happened:
A Pod was scheduled for deletion because a taint w/ effect NoExecute
was observed in the Node the Pod was assigned to, and after the taint is removed from the Node, the Pod scheduled deletion isn’t canceled.
After scheduling pod deletion, the NoExecuteTaintManager
only cancels pod deletion if all NoExecute
taints are removed from the node, including any taints that the user relies on for their use-cases (example below).
What you expected to happen:
After the taint that triggered scheduling pod deletion is removed, NoExecuteTaintManager
cancels the scheduled pod deletion.
How to reproduce it (as minimally and precisely as possible):
- A node
MyNode
is tainted w/OnlyForMyUsage:NoExecute
, so any workloads already assigned to it which don’t tolerate this taint are evicted. - A pod
MyPod
is created w/ tolerations:OnlyForMyUsage:NoExecute
w/ unspecifiedtolerationSeconds
andNotReady:NoExecute
w/tolerationSeconds: 300
. MyPod
is assigned toMyNode
.- At some point in time after the previous steps took place,
MyNode
is tainted asNotReady:NoExecute
. NoExecuteTaintManager
gets an update event forMyNode
and observes twoNoExecute
taints. It proceeds to calculate the minimum time (in seconds) thatMyPod
tolerates for the two taints which at this point is 300 seconds, and marksMyPod
for deletion in ~300 seconds - this happens in-memory, in a timed-queue.- At some point in time after the previous steps took place, and before the 300 seconds trigger happen,
NotReady:NoExecute
taint is removed fromMyNode
. NoExecuteTaintManager
gets an update event forMyNode
and observes only oneNoExecute
taint,OnlyForMyUsage:NoExecute
. It proceeds to calculate the minimum time (in seconds) thatMyPod
tolerates for the this taint which is infinity, and returns, not canceling the previous deletion. It completely ignores the fact that the taint that triggeredMyPod
deletion is no longer observed.
Anything else we need to know?:
- Worked w/ @bobveznat to figure this out after all pods in a node got evicted without apparent reason.
Environment:
- Kubernetes version (use
kubectl version
): 1.17.3 (but looking at the code seems to affect all of 1.17.x and 1.18.x and master. - Cloud provider or hardware configuration: N/A
- OS (e.g:
cat /etc/os-release
): N/A - Kernel (e.g.
uname -a
): 5.x - Install tools: N/A
- Network plugin and version (if this is a network-related bug): N/A
- Others: N/A
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 21 (20 by maintainers)
Looks like we lost some momentum here and I’d like to avoid this falling into the abyss of bugs that never get fixed. When this bug is triggered the outcome is pretty terrible and once it does happen it is very difficult to dig into and understand what actually happened. Did we ever get this routed to the right owner?
API machinery owns the mechanics of the controller manager (controller loop setup and management), but the individual controllers are SIG-specific
I think node lifecycle is jointly owned by sig-node and sig-cloud-provider