kubernetes: pod stuck on the nominated node cannot release resource timely
What would you like to be added: Set the capping of retrying on the nominated node so that other nodes in the cluster could be chosen for preemption.
Why is this needed:
After the pod preemption, a nominated could be set for the preemptor pod. After the restful API of deletion has been sent,
https://github.com/kubernetes/kubernetes/blob/9a6e35a16a92feac757bf0621a09a2661f617617/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go#L601-L603
the resource is not actually been deleted, this work needs to be done by kubelet
and container runtime, docker for example.
This is out of control of scheduler, something abnormal will make the resource never been released, docker is down for example, but when this happens, the scheduler should be able to preempt pods on other nodes in the cluster to release resource for the high priority node, but this is not true today cause the pod is not eligible to preempt resource from other node https://github.com/kubernetes/kubernetes/blob/9a6e35a16a92feac757bf0621a09a2661f617617/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go#L211-L213
This will lead to a infinite looping of scheduling and the checking for the preemption, even there are other healthy node that is able to schedule the pod after releasing enough resource.
filtering that have been evaluated that maybe impacted by the preemption
-
PodTopologySpread preemptor pod can still be scheduled on the
nominatedNode
even the resource is not released, since the pod deletion on-the-fly is skipped https://github.com/kubernetes/kubernetes/blob/9a6e35a16a92feac757bf0621a09a2661f617617/pkg/scheduler/framework/plugins/podtopologyspread/common.go#L91-L93 -
NodeResourcesFit preemptor pod will be stuck on the nominated node and the scheduler will loop forever!
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 21 (21 by maintainers)
I think we’re talking two things here:
I understand that add retrying times may resolve the problem (1), with a cost of maintaining a stateful counter, which I’m not quite sure I’m fond of. Instead of solving (1), I believe we should brainstorm on (2) so as to clear the issue thoroughly. Here are some rough ideas to resolve (or mitigate) it:
victim.DeletionTimestamp.Before(preemptor.CreationTimestamp.Add(tolerationDurtion))
. If yes, it means we have waited enough time, so disregard the terminating victims; otherwise, still give some patience for victims to finish their teardown work. BTW:preemptor.CreationTimestamp
is a draft idea, we may use lastScheuledTimestamp alternatively.One more thing to think about: a Pod that requires more Pods to be preempted should have more retries.
What is your proposal? number of retries or perhaps a time limit? Should we make it configurable in the preemption plugin args?