kubernetes: race condition detected during the scheduling with preemption

What happened: race condition detected during the scheduling with preemption

What you expected to happen: the preemptor with high priority should be scheduled successfully.

How to reproduce it (as minimally and precisely as possible): In the case of preemption, with the below assumption,

  • node1 has the replication controller with couple of pods running, and consider to be the candidate node for the preemption,

    image

  • node2 also has some pods running on it but will not be considered to be the candidate node due to either priority or topology constraints, image

the scheduling sequence very likely will hit the race condition like below,

step1: preemptor pod (pod6) is coming and found no fit node to schedule step2: error is recorded and the pod6 is added into the unschedulableQ step3: preemption is happen on node1 and pod1 is deleted step4: pod1 is added to activeQ step5: field of NominatedNodeName is patched to pod6 step6: next scheduling cycle is started and the pod1 is only pod in the activeQ step7: going to schedule pod in the activeQ (pod1) and the candidate node is node1 step8: event of OnUpdate of the preemptor pod (pod6) is captured and pod6 is moved from unschedulableQ to activeQ step9: pod1 is scheduled on the node1 step10: next scheduling cycle is started and the pod6 is the only pod in the activeQ step11: preemption is happened again and pod2 is removed this time. step12: error is recorded again and the pod is moved to podBackoffQ, and the nominatedNode is still node1 step13: pod condition is not changed, nominatedNode is not changed, so the pod is not updated, pod6 is still in the podBackoffQ. step14: pod6 needs to wait for backoff time expired before getting moved from podBackoffQ to activeQ step15: pod2 is scheduled on node1.

looping forever from here !!! along with pod deletion and preemption on node1. … …

The root cause is step8 OnUpdate event is not guaranteed to be finished before step7, they are handled asynchronously.

PR: https://github.com/kubernetes/kubernetes/pull/93179 is happened to partially addressed this issue by updating the nominatedNode to nil in step13, the pod is moved from podBackoffQ to activeQ by the change, this makes it possible to break the infinite loop since high priority preemptor got a chance to be added into the activeQ before the next scheduling cycle is started.

The final solution per my understanding is we need to add a check here, maybe hold for an while util we found the preemptor is added into activeQ

https://github.com/kubernetes/kubernetes/blob/e79e352d36258abc5e5659289ec0fb13634bcbe7/pkg/scheduler/scheduler.go#L336-L343

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Comments: 53 (49 by maintainers)

Most upvoted comments

Has there been any traction on this behind the scenes in the past 3 years?

This issue was filed many years ago, it was an issue with default scheduler at that time. There were many optimization / refactoring around the preemption during the period, so this issue properly not stand at the moment.

I will find time to check this issue again, some labs maybe, and close it if this is not issue anymore.

@ermirry As to your issue, I am not clear how your scheduler is implemented and the manifests you are using, we need more evidence.

In the example, we see that spreading constraints are matching pods that don’t themselves have the same spreading constraint. Is this something we actually should support? Specially when different priorities are involved.

I’m more inclined to document this as a limitation, rather than trying to check for other pods’ rules internally in the scheduler plugin.

But we can still fix the “enter queue”->“issue deletions” order.

step4: pod1 is added to activeQ step5: field of NominatedNodeName is patched to pod6 step6: next scheduling cycle is started and the pod1 is only pod in the activeQ step7: going to schedule pod in the activeQ (pod1) and the candidate node is node1 step8: event of OnUpdate of the preemptor pod (pod6) is captured and pod6 is moved from unschedulableQ to activeQ step9: pod1 is scheduled on the node1

Emm…I think the pod1 won’t be successful for scheduling since the nominatedPod is added to the node’s snapshot at https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/core/generic_scheduler.go#L469, so the resource is already taken by pod6 in the node’s snapshot.

oops, sorry, I talk about the same thing ahg-g already talked.

Another idea, but also for more than one preempted: Maybe the preemptor pod enters a new scheduling cycle (after the first pod gets deleted), still gets rejected because it still doesn’t fit. But while the pod is being retried, the other pod deletion events come in.

I designed the case that preemptor will fit after the victim pod is got removed.

For one pod, I can only think of the case were pod delete event comes before we put the pod back in unschedulable queue.

pls let me do more investigation, properly in my next day. 😃

step4: pod1 is added to activeQ

pod1 was deleted, and so it will not be put back into the active queue. The replication controller will receive the delete event and create a new pod pod1.1 to replace it.

It is true though that an add event to pod1.1 could be received and added to the active queue while preemptor pod is still in backoff, but that is not as fast as saying “pod1 is added to activeQ”.

step6: next scheduling cycle is started and the pod1 is only pod in the activeQ step7: going to schedule pod in the activeQ (pod1) and the candidate node is node1

That shouldn’t happen, nominated pods are taken into account when evaluating filters: https://github.com/kubernetes/kubernetes/blob/8d74486a6a1a8fb2246bd89faf1746393135a463/pkg/scheduler/core/generic_scheduler.go#L469