kubernetes: Unexpected scheduling of replacement Pod upon Pod deletion
Background
As you may have known, internally, the Kubernetes scheduler handles “unassignedPod” (nil .spec.nodeName) and “assignedPod” (non-nil .spec.nodeName) simultaneously.
-
one to be handled in the internal cache: https://github.com/kubernetes/kubernetes/blob/15c3f1b119ff4b4073c86df05c819a9672b80d68/pkg/scheduler/eventhandlers.go#L348-L349
-
while the other to be dealt with in internal queue(s): https://github.com/kubernetes/kubernetes/blob/15c3f1b119ff4b4073c86df05c819a9672b80d68/pkg/scheduler/eventhandlers.go#L373-L374
It works most of the time except for the case that the scheduling decision of added Pod strictly depends on a previously deleted Pod. For example, if you delete a Pod from a Deployment, (1) the Pod will be deleted, and then (2) a backfilling Pod will be created. Looking at the timeline, (1) happens prior to (2), however, in scheduler, (2) may be processed when (1) is still happening. In that case, the decision for the backing Pod is based on the presence of the terminating Pod, which is not the desired result.
Using pod topology spread constraints or pod anti-affinity can easily reproduce this scenario. #86037 is one DP.
Is it an Issue?
Before jumping to possible solutions, we should brainstorm to conclude whether it’s an issue or not.
- We should firstly verify, in the client-go/API’s perspective, whether the informers will always get the new Pod event after the deleted Pod event. If it’s uncertain, we don’t need to continue the discussion.
- In the scheduler’s point of view, can we simply say: it’s not an issue as we handle Pod events individually, instead of handling one event to waiting for a preceding event to be finished. (no need to say, it’s hard to figure out the implicit Pod dependency)
- From the user’s perspective, it’s an issue as the scheduling decision for the replacement Pod doesn’t make sense.
Workaround
As mentioned in https://github.com/kubernetes/kubernetes/issues/86037#issuecomment-563095248, we can use kubectl delete pod <pod name> --force=true --grace-period=0 to alleviate the issue.
Possible Solutions
(This part will be updated now and then…)
- Hold the scheduling cycle (using proper lock) if a dependent terminated Pod is being processed in the caching goroutine.
- Using a workqueue to enqueue both assignedPod and unassignedPod, then process each item to in either caching or schedulingQ. But this may slow down the throughput.
/sig scheduling /priority important-longterm /assign
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 28 (28 by maintainers)
I think there may be a case like this: Say there is a deployment: