kubernetes: kube-scheduler crashes and restarts with panics in DefaultPreemption plugin

What happened:

Kubernetes 1.19 (confirmed with 1.19.7 and 1.19.10)

kube-scheduler crashes and restarts with the following errors in default_preemption.go.

panic: runtime error: index out of range [0] with length 0

The problem code is line 389. When victims is nil or victims.Pods is empty, the error happens. If we skip GetPodPriority when it’s nil or empty, the error is gone.

387         victims := nodesToVictims[node]
388         // highestPodPriority is the highest priority among the victims on this node.
389         highestPodPriority := podutil.GetPodPriority(victims.Pods[0])

panic: runtime error: index out of range [0] with length 0 [recovered]
panic: runtime error: index out of range [0] with length 0
goroutine 1406 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/Users/yuanchen/projects/aci/go/kube-scheduler/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x109
panic(0x1c9fde0, 0xc00748cf48)
 /usr/local/Cellar/go/1.16.2/libexec/src/runtime/panic.go:965 +0x1b9
k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption.pickOneNodeForPreemption(0xc007e3dbc0, 0x4, 0x4)
/Users/yuanchen/projects/aci/go/kubescheduler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go:389 +0xd2c
k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption.SelectCandidate(0xc0086e0840, 0x4, 0x4, 0xc008235800, 0x2068138)
/Users/yuanchen/projects/aci/go/kubescheduler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go:331 +0x85
k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption.(*DefaultPreemption).preempt(0xc000ac4b80, 0x20677d0, 0xc005849540, 0xc0085616e0, 0xc008235800, 0xc008561710, 0x2d80750, 0xc00820a8c0, 0x7f0937ef47d8, 0x30)
/Users/yuanchen/projects/aci/go/kube-scheduler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go:135 +0x4bd
k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption.(*DefaultPreemption).PostFilter(0xc000ac4b80, 0x20677d0, 0xc005849540, 0xc0085616e0, 0xc0071543a8, 0xc008561710, 0x0, 0x0)
/Users/yuanchen/projects/aci/go/kube-scheduler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go:83 +0xf1
k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).runPostFilterPlugin(0xc0002fd380, 0x20677d0, 0xc005849540, 0x7f0937a84040, 0xc000ac4b80, 0xc0085616e0, 0xc0071543a8, 0xc008561710, 0xc0084f3ab0, 0x5f5e100)
/Users/yuanchen/projects/aci/go/kube-scheduler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:557 +0x87

panic: runtime error: invalid memory address or nil poster dereference

452 latestStartTime := util.GetEarliestPodStartTime(nodesToVictims[minNodes2[0]])

The problem is nodesToVictims[minNodes2[0]] does not exist and returns nil sometimes. Simply skipping it won’t solve the problem. The scheduler will reach either line 456 or 341.

456        klog.Errorf("earliestStartTime is nil for node %s. Should not reach here.", minNodes2[0])

341 klog.Errorf("None candidate can be picked from %v.", candidates)

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x15706d7]
goroutine 1385 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/Users/yuanchen/projects/aci/go/kube-scheduler/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x109
panic(0x1b448c0, 0x2d281b0)
/usr/local/Cellar/go/1.16.2/libexec/src/runtime/panic.go:965 +0x1b9
k8s.io/kubernetes/pkg/scheduler/util.GetEarliestPodStartTime(0x0, 0xc004d4d9e0)
/Users/yuanchen/projects/aci/go/kube-scheduler/vendor/k8s.io/kubernetes/pkg/scheduler/util/utils.go:56 +0x37
k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption.pickOneNodeForPreemption(0xc004d4d9e0, 0x4, 0x4)
/Users/yuanchen/projects/aci/go/kube-scheduler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go:452+0x7a9
k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption.SelectCandidate(0xc0066e5180, 0x4, 0x4, 0xc0062f5000, 0x2068138)
/Users/yuanchen/projects/aci/go/kube-scheduler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go:331 +0x85
k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption.(*DefaultPreemption).preempt(0xc000f9e880, 0x20677d0, 0xc004fa3340, 0xc0061b5290, 0xc0062f5000, 0xc0061b52c0, 0x2d80750, 0xc00603fc40, 0x7f6bcbc363f8, 0x30)
/Users/yuanchen/projects/aci/go/kube-scheduler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go:135 +0x4bd
: k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption.(*DefaultPreemption).PostFilter(0xc000f9e880, 0x20677d0, 0xc004fa3340, 0xc0061b5290, 0xc002523ac8, 0xc0061b52c0, 0x0, 0x0)

What you expected to happen:

The scheduler works without failures.

How to reproduce it (as minimally and precisely as possible):

When there are preemptions in a cluster.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): v1.19.10 or v1.19.7
Cloud provider or hardware configuration: bare metal
OS (e.g: cat /etc/os-release): centos 7
Kernel (e.g. uname -a): Linux 5.4.77-7.el7pie #1 SMP Sat Nov 21 01:16:27 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 36 (35 by maintainers)

Most upvoted comments

for starter, we should check the list len, the scheduler should not panic in all cases.

Other than that, I think we need to clearly document what the preemption logic does (that it executes the prefilter extensions points and the filter plugins multiple times) and the expectations from filter plugins (that they may get executed more than once in the same cycle)

ahg-g on Apr 28, 2021

in the preemption phase we run the filters again to check if removing lower priority pods make the node schedulable: https://github.com/kubernetes/kubernetes/blob/d9839a30c78c55f4cda8edb8effa249e9bb42884/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go#L543

we later add them one by one to reduce the set of victim pods to the absolute minimum: https://github.com/kubernetes/kubernetes/blob/d9839a30c78c55f4cda8edb8effa249e9bb42884/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go#L558

My hypothesis is that somehow your filter returns success when adding all the pods back, and so you end up with zero victims and a candidate node. This shouldn’t happen because if the pod fits the node without removing any victims, then we shouldn’t be running preemption in the first place. So the theory is that the custom filter is returning false when run in the filter phase, but returns true when executed in the preemption phase perhaps because it makes some assumptions about cyclestate that makes it behave this way (e.g. that a the filter will be executed once per node in a scheduling cycle).

ahg-g on Apr 28, 2021