kubernetes: Unexpected scheduling of replacement Pod upon Pod deletion

Background

As you may have known, internally, the Kubernetes scheduler handles “unassignedPod” (nil .spec.nodeName) and “assignedPod” (non-nil .spec.nodeName) simultaneously.

It works most of the time except for the case that the scheduling decision of added Pod strictly depends on a previously deleted Pod. For example, if you delete a Pod from a Deployment, (1) the Pod will be deleted, and then (2) a backfilling Pod will be created. Looking at the timeline, (1) happens prior to (2), however, in scheduler, (2) may be processed when (1) is still happening. In that case, the decision for the backing Pod is based on the presence of the terminating Pod, which is not the desired result.

Using pod topology spread constraints or pod anti-affinity can easily reproduce this scenario. #86037 is one DP.

Is it an Issue?

Before jumping to possible solutions, we should brainstorm to conclude whether it’s an issue or not.

  • We should firstly verify, in the client-go/API’s perspective, whether the informers will always get the new Pod event after the deleted Pod event. If it’s uncertain, we don’t need to continue the discussion.
  • In the scheduler’s point of view, can we simply say: it’s not an issue as we handle Pod events individually, instead of handling one event to waiting for a preceding event to be finished. (no need to say, it’s hard to figure out the implicit Pod dependency)
  • From the user’s perspective, it’s an issue as the scheduling decision for the replacement Pod doesn’t make sense.

Workaround

As mentioned in https://github.com/kubernetes/kubernetes/issues/86037#issuecomment-563095248, we can use kubectl delete pod <pod name> --force=true --grace-period=0 to alleviate the issue.

Possible Solutions

(This part will be updated now and then…)

  • Hold the scheduling cycle (using proper lock) if a dependent terminated Pod is being processed in the caching goroutine.
  • Using a workqueue to enqueue both assignedPod and unassignedPod, then process each item to in either caching or schedulingQ. But this may slow down the throughput.

/sig scheduling /priority important-longterm /assign

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 28 (28 by maintainers)

Most upvoted comments

I think there may be a case like this: Say there is a deployment:

  1. Pods in the deployment use host-path volume.
  2. A user wants to guarantee that one host-path in a node can only be used by one pod, so he uses pod anti-affinity to achieve this.
  3. The pod’s terminating behavior is cleaning up the host-path.
  4. Once a pod is stuck at terminating (for any reason). We may schedule a new pod to this node as we exclude the terminating one.
  5. The new pod start running, and write something to the host-path.
  6. The previous pod successfully terminated, and clean up what the new pod has written.
  7. Finally, things messed up.