kubernetes: Totally avoid Pod starvation (HOL blocking) or clarify the user expectation on the wiki

What would you like to be added:

Seems even after #81263 (#81214), the pod starvation (head-of-line blocking) can only be besteffort alleviated, i.e. cannot be totally avoided.

This is even more worse in this case https://github.com/kubernetes/kubernetes/issues/86373#issuecomment-570186182

We cannot config the backoff duration to be infinite large to totally avoided it, since it will sacrifice other things, such as large scheduling latency (https://github.com/kubernetes/kubernetes/issues/81214#issuecomment-520008883), more cut in line rates (https://github.com/kubernetes/kubernetes/issues/83834), more frequent preemptions (https://github.com/kubernetes/kubernetes/pull/81698), etc.

So, we have to tune a suitable backoff duration in different clusters, which is too time consuming. And even if we found a suitable backoff duration initially, because whether the starvation happens still depends on at least below things which may be varied from time to time:

The scheduling performance in the physical clusters which may be impacted by hardware performance (even worse if scheduler extender is also used), such as hardware becomes slower after used for a lot of time, or network congestions, etc.
workload patterns,
cluster events frequency, such as pod/node add/del events
K8S versions

So we have to tune it back and forth, too much cluster operator maintance efforts. And current design seems unreasonable, because scheduling behavior depends on too many uncontrollable and volatile things. Thus it causes user feels that our scheduling behavior is confusing and differs in different time and environments.

So, K8S default scheduler still cannot strictly respect the behavior on the wiki and the word to deprecate FIFO queue:

As a result, the higher priority Pod may be scheduled sooner than Pods with lower priority if its scheduling requirements are met. If such Pod cannot be scheduled, scheduler will continue and tries to schedule other lower priority Pods. https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/#effect-of-pod-priority-on-scheduling-order

After the introduction of Pod Priority and Preemption, we added a new scheduling queue which is Priority aware and has various features to provide fairness and to ensure that a high priority unschedulable pod does not block the head of the queue. https://github.com/kubernetes/kubernetes/issues/76172#issue-429528520

So, do we have any plan to totally avoid pod starvation in future? (I have proposed one in bottom)

If not, shall we refine the words on the wiki to clarify the user expectation for queuing behavior? So, the starvation (head-of-line blocking), low priority Pod may be blocked even if cluster has free resources for it, will be clarified to be a by design behavior instead of an issue anymore. And developers/operators who care about starvation (e.s.p. who works for multi-tenant), can also notice the starvation in advance, then tuning backoff or implememt their own queue sort plugin or even implement their own scheduler, instead of until the real issue happen in prod.

Possible idea for no need tune backoff duration: By this idea , seems we can avoid head-of-line blocking and large scheduling latency (https://github.com/kubernetes/kubernetes/issues/81214#issuecomment-520008883), more cut in line rates (https://github.com/kubernetes/kubernetes/issues/83834), more frequent preemptions (https://github.com/kubernetes/kubernetes/pull/81698), mentioned above.

Goal: Every Pod, even low priority Pod, should be eventually tried to schedule while being respect the priority order to schedule.

Basic Idea:

Remove backoff feature.
Before any scheduling, snapshot current scheduling view (such as free resouces, etc).
Then, scheduling on the snapshot, only dequeue from the priority queue to schedule, and does not enqueue failed pod back in this step, and only update just allocated/assumed resources, and does not update just freed resources. Note, during this step, we continue to enqueue newly arrived pods as current behavior, to ensure that at the time the new pod is arrived, it will be still scheduled before lower priority pods in the queue.
Until all pods in the priority queue are dequeued (scheduled), then re-snapshot current scheduling view, so all things are updated, then enqueue all pending pods back to the queue, and start over above scheduling round, i.e. step 2.

In this way, we guarantee that: Every free resource is first tried by high priority Pod, then low priority Pod, instead of just blocking or skipping the low priority Pod (which may cause starvation), more formally: For every free resource, a pod with priority PT arrives at the time TE, will be scheduled (against all these free resource) before, all pods < PT and > TE.

So we do not need backoff anymore, and the queuing behavior does not depend on scheduling performance anymore, and it is aligned with current wiki, which is more reasonable, nature and intuitive for most users (maybe the best user expectation).

/sig scheduling @bsalamat and @draveness PTAL

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 83 (74 by maintainers)

Most upvoted comments

@alculquicondor One thing to note is that the 100 pods/s metric is based on fairly simple workloads. It the Pods comes with complex spec such as PodAffinity, a slowdown on scheduling is expected. And as @yqwang-ms , an extender usage would further slowdown a scheduling cycle incredibly (due to marshal/unshaling as well as network overhead).

Huang-Wei on Feb 5, 2020

@ahg-g Thanks for the clarification, seems we do not plan to first class support multi-tenant in default scheduler at least in short term.

Anyway, as @alculquicondor said, even in short term, we at least need to make the default scheduler (and the whole K8S) more extensible, so that scheduling plugin and/or scheduler extender is able to easily support multi-tenant by themselves. (This kind of prototyping may also make the first class support possible and smooth)

Another point is that we should better improve the default scheduler (first class support or more extensible), instead of re-implement a new scheduler like kube-batch, right?

For the more extensible, I mean we may also need to support more features, such as:

Queue sort plugin #87160
Requeue plugin It is used to judge whether a failed scheduled pod should be requeued now. So that plugin can override the behaiour of unschedulable queue, backoff queue, etc which also impacts on this issue. For example, it should provide a func ShouldRequeueNow(pod) bool.
Queue empty listener plugin When the active queue is empty, it will call this plugin. This helps to implement the behaiour in my first proposal, i.e. only after it is called, we can requeue the failed scheduled pods, and resnapshot the scheduling view
Preemption plugin It is a must for multi-tenant as alculquicondor said, but could not find it in https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/20180409-scheduling-framework.md
Object to define the tenant deicated resource Maybe we can reuse the namespace and resource quota object.

yqwang-ms on Jan 16, 2020

note we do not add new free resource during step 2

That is the problem. There might be some resource released, then free resource meets the demand of the high priroity pods, but it could not get scheduled until the schedule round ends. Howerver these resource might be allocated to new pods with same priority.

And there might be other cases we have not considered. I think a proposal or KEP is needed.

hex108 on Jan 3, 2020