kubernetes: After 1.8, scheduler could reject unknown extended resource names

/kind bug

What happened: Not sure yet whether this should be considered as a regression or WAI.

Basically, with PR https://github.com/kubernetes/kubernetes/pull/48922, we introduced ExtendedResources that covers all resources outside the default kubernetes.io namespace. In scheduler PodFitsResources(), for every extended resource, we check whether the node has enough resource quantity for it. This introduced a behavior change for non OIR resources outside the default kubernetes.io namespace. I.e., before PR 48922, scheduler wouldn’t reject them even if there was no actual resource advertised by the node, but after PR 48922, scheduler will reject them with InsufficientResourceError.

I am opening this issue to discuss whether this behavior change is ok. Please feel free to add any related folks to the discussion.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.8+
Cloud provider or hardware configuration**:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 23 (19 by maintainers)

Commits related to this issue

Merge pull request #60332 from yguo0905/sched Automatic merge from submit-queue (batch tested with PRs 60236, 60332, 57375, 60451, 57408). If you want to cherry-pick this change to another branch, pl... — committed to kubernetes/kubernetes by deleted user 6 years ago

Most upvoted comments

@yguo0905 your summary matches my thoughts. Let me highlight the use case a bit more:

A user intends to expose cluster level resources. They start with running a scheduler extender which understands their new resource. Along the way they realize that the scheduler requires all resources, including cluster level ones, to be attached to a node (part of node capacity/allocatable). This then requires extenders to attach cluster level resources to nodes as part of scheduling which is not desirable.

The solution proposed in the previous comment solves this problem by giving scheduler hints on cluster level resources which lets the scheduler not require all resources to be part of node capacity.

We have an unrelated need for wanting the scheduler to not reach out to extenders unless a pod requests a resource that the extender exposes. This is meant to make extenders safe to use in large (or high churn) clusters. For this purpose, the extender configuration can include an optional list of resources that the extender supports. The scheduler can then reach out to the extender only when it processes a pod that specifies a resource supported by the extender. By default, the extender will be invoked for all pods, thereby preserving existing behavior. This config change can also be used by the scheduler to track cluster level resources. For instance, if the extender config where to include additional metadata that classifies certain resources it supports as cluster level ones, scheduler can then stop expecting such resources to be tied to nodes.

This is only part of the story though since the node also performs feasibility checks and it also expects all compute resources to be available in node capacity/allocatable. For the purposes of supporting cluster level resources and enabling overall compute resource extensibility, the node needs to ignore extended resources that do not belong to any device plugins that are currently registered. There can be a scenario where a device plugin un-registered itself after a scheduler had placed a pod requesting a resource the plugin exposed. With the proposed model, kubelet may incorrectly admit the pod eventhough the device plugin expected to handle one of the resources the pod wants is missing. We consider this to be a known limitation of the system and the recommendation is to tie the health of the node to the availability and health of device plugins. That is, a node is marked “unhealthy” and drops all incoming pods if a device plugin on that node fails.

vishh on Feb 13, 2018

Added this issue to the sig-scheduling agenda for Oct 12

ConnorDoyle on Oct 11, 2017