kubernetes: Pods in Node are deleted even if the NoExecute taint that triggered deletion is removed

What happened:

A Pod was scheduled for deletion because a taint w/ effect NoExecute was observed in the Node the Pod was assigned to, and after the taint is removed from the Node, the Pod scheduled deletion isn’t canceled.

After scheduling pod deletion, the NoExecuteTaintManager only cancels pod deletion if all NoExecute taints are removed from the node, including any taints that the user relies on for their use-cases (example below).

What you expected to happen:

After the taint that triggered scheduling pod deletion is removed, NoExecuteTaintManager cancels the scheduled pod deletion.

How to reproduce it (as minimally and precisely as possible):

  • A node MyNode is tainted w/ OnlyForMyUsage:NoExecute, so any workloads already assigned to it which don’t tolerate this taint are evicted.
  • A pod MyPod is created w/ tolerations: OnlyForMyUsage:NoExecute w/ unspecified tolerationSeconds and NotReady:NoExecute w/ tolerationSeconds: 300.
  • MyPod is assigned to MyNode.
  • At some point in time after the previous steps took place, MyNode is tainted as NotReady:NoExecute.
  • NoExecuteTaintManager gets an update event for MyNode and observes two NoExecute taints. It proceeds to calculate the minimum time (in seconds) that MyPod tolerates for the two taints which at this point is 300 seconds, and marks MyPod for deletion in ~300 seconds - this happens in-memory, in a timed-queue.
  • At some point in time after the previous steps took place, and before the 300 seconds trigger happen, NotReady:NoExecute taint is removed from MyNode.
  • NoExecuteTaintManager gets an update event for MyNode and observes only one NoExecute taint, OnlyForMyUsage:NoExecute. It proceeds to calculate the minimum time (in seconds) that MyPod tolerates for the this taint which is infinity, and returns, not canceling the previous deletion. It completely ignores the fact that the taint that triggered MyPod deletion is no longer observed.

Anything else we need to know?:

  • Worked w/ @bobveznat to figure this out after all pods in a node got evicted without apparent reason.

Environment:

  • Kubernetes version (use kubectl version): 1.17.3 (but looking at the code seems to affect all of 1.17.x and 1.18.x and master.
  • Cloud provider or hardware configuration: N/A
  • OS (e.g: cat /etc/os-release): N/A
  • Kernel (e.g. uname -a): 5.x
  • Install tools: N/A
  • Network plugin and version (if this is a network-related bug): N/A
  • Others: N/A

cc @gmarek @bowei @k82cn (owners of this code)

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 21 (20 by maintainers)

Most upvoted comments

Looks like we lost some momentum here and I’d like to avoid this falling into the abyss of bugs that never get fixed. When this bug is triggered the outcome is pretty terrible and once it does happen it is very difficult to dig into and understand what actually happened. Did we ever get this routed to the right owner?

API machinery owns the mechanics of the controller manager (controller loop setup and management), but the individual controllers are SIG-specific

I think node lifecycle is jointly owned by sig-node and sig-cloud-provider