kubernetes: pod stuck on the nominated node cannot release resource timely

What would you like to be added: Set the capping of retrying on the nominated node so that other nodes in the cluster could be chosen for preemption.

Why is this needed:

After the pod preemption, a nominated could be set for the preemptor pod. After the restful API of deletion has been sent, https://github.com/kubernetes/kubernetes/blob/9a6e35a16a92feac757bf0621a09a2661f617617/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go#L601-L603 the resource is not actually been deleted, this work needs to be done by kubelet and container runtime, docker for example.

This is out of control of scheduler, something abnormal will make the resource never been released, docker is down for example, but when this happens, the scheduler should be able to preempt pods on other nodes in the cluster to release resource for the high priority node, but this is not true today cause the pod is not eligible to preempt resource from other node https://github.com/kubernetes/kubernetes/blob/9a6e35a16a92feac757bf0621a09a2661f617617/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go#L211-L213

This will lead to a infinite looping of scheduling and the checking for the preemption, even there are other healthy node that is able to schedule the pod after releasing enough resource.

filtering that have been evaluated that maybe impacted by the preemption

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 21 (21 by maintainers)

Most upvoted comments

I think we’re talking two things here:

  1. When to dishonorr a Pod’s nominatedNodeName field?
  2. The “over-relaxed” logic that checks whether a nominated node should continue, as @chendave mentioned: https://github.com/kubernetes/kubernetes/blob/9a6e35a16a92feac757bf0621a09a2661f617617/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go#L211-L213

I understand that add retrying times may resolve the problem (1), with a cost of maintaining a stateful counter, which I’m not quite sure I’m fond of. Instead of solving (1), I believe we should brainstorm on (2) so as to clear the issue thoroughly. Here are some rough ideas to resolve (or mitigate) it:

  1. Once scheduler preempts a Pod, set a special .status.condition[*] to tag it as a victim of scheduler preemption. This not only helps with later scheduling, but also helps end-users to distinguish scheduler preemption from kubelet eviction. This condition can help us identify whether those low-priority terminating Pods are victims or not.
  2. With the special condition being said, if the Pod is a victim, we can define a fixed timeout serving as a “tolerationGracePeriod” to check whether victim.DeletionTimestamp.Before(preemptor.CreationTimestamp.Add(tolerationDurtion)). If yes, it means we have waited enough time, so disregard the terminating victims; otherwise, still give some patience for victims to finish their teardown work. BTW: preemptor.CreationTimestamp is a draft idea, we may use lastScheuledTimestamp alternatively.

One more thing to think about: a Pod that requires more Pods to be preempted should have more retries.

What is your proposal? number of retries or perhaps a time limit? Should we make it configurable in the preemption plugin args?