kubernetes: Scheduler dies with "Schedulercache is corrupted"

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

/sig scheduling

What happened:

The scheduler panics with “Schedulercache is corrupted and can badly affect scheduling decisions”. There are a few related issues reporting a similar bug (#46347, #47150), but in both of those issues the recommended fix is to upgrade to etcd 3.0.17. We’re using etcd 3.2.0 which is newer than that. This happened in our production cluster twice today.

some evidence that this is issue an with etcd:

scheduler fails with “Schedulercache is corrupted and can badly affect scheduling decisions”
another scheduler on another machine gets elected the leader
that scheduler also fails with “Schedulercache is corrupted and can badly affect scheduling decisions”

Conversely, evidence that it isn’t an issue with etcd: when we later restart the scheduler (~30 minutes later), the system recovers without any changes to etcd (according to the apiserver lots). So etcd’s contents are at least not permanently corrupted.

Here’s an excerpt of the logs. full logs are in this gist (all the relevant kubelet/controller manager/apiserver/scheduler logs)

======== scheduler (kubemaster--0fd4bec98eba7656c) ======

I0818 09:16:25.678670       1 factory.go:511] About to try and schedule pod explorer-job-1503047460-7681l
I0818 09:16:25.678677       1 scheduler.go:261] Attempting to schedule pod: scheduled-jobs/explorer-job-1503047460-7681l
W0818 09:16:25.679040       1 cache.go:228] Pod scheduled-jobs/explorer-job-1503047460-7681l assumed to a different node than added to.
I0818 09:16:25.679059       1 factory.go:725] Attempting to bind explorer-job-1503047460-7681l to ip-10-68-30-4.us-west-2.compute.internal
E0818 09:16:25.679061       1 cache.go:263] Pod scheduled-jobs/explorer-job-1503047460-7681l updated on a different node than previously added to.
W0818 09:16:25.679040       1 cache.go:228] Pod scheduled-jobs/explorer-job-1503047460-7681l assumed to a different node than added to.
E0818 09:16:25.679061       1 cache.go:263] Pod scheduled-jobs/explorer-job-1503047460-7681l updated on a different node than previously added to.
F0818 09:16:25.679081       1 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions


>>>> stack trace redacted: panic <<<<

======== apiserver audit ======

2017-08-18T09:16:25.679953099Z AUDIT: id="1e0516d4-9a95-4b4b-aaee-98a0894b3625" ip="10.68.24.10" method="POST" user="kubemaster--0fd4bec98eba7656c.nw" groups="\"kubemaster.northwest.prod\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="scheduled-jobs" uri="/api/v1/namespaces/scheduled-jobs/pods/explorer-job-1503047460-7681l/binding"

======== scheduler (kubemaster--0aa8e87a222ebc781.nw) ======

I0818 09:16:45.738257       1 leaderelection.go:189] successfully acquired lease kube-system/kube-scheduler
I0818 09:16:45.738413       1 event.go:218] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"kube-scheduler", UID:"3d55f039-8367-11e7-8b13-02708673f6f6", APIVersion:"v1", ResourceVersion:"737096", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' kubemaster--0aa8e87a222ebc781.nw became leader
I0818 09:16:45.838560       1 shared_informer.go:116] caches populated

I0818 09:19:22.361105       1 leaderelection.go:204] successfully renewed lease kube-system/kube-scheduler
I0818 09:19:24.369669       1 leaderelection.go:204] successfully renewed lease kube-system/kube-scheduler
I0818 09:19:25.338146       1 reflector.go:405] external/com_github_kubernetes/plugin/cmd/kube-scheduler/app/server.go:100: Watch close - *v1.Pod total 21 items received
I0818 09:19:25.340488       1 factory.go:511] About to try and schedule pod explorer-job-1503047460-7681l
I0818 09:19:25.340495       1 scheduler.go:261] Attempting to schedule pod: scheduled-jobs/explorer-job-1503047460-7681l
W0818 09:19:25.340723       1 cache.go:228] Pod scheduled-jobs/explorer-job-1503047460-7681l assumed to a different node than added to.
E0818 09:19:25.340741       1 cache.go:263] Pod scheduled-jobs/explorer-job-1503047460-7681l updated on a different node than previously added to.
F0818 09:19:25.340759       1 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions

Environment:

Kubernetes version (use kubectl version): 1.7.4
Etcd version: 3.2.0-rc.1+git
Cloud provider or hardware configuration**: AWS
OS: Ubuntu 15.04
Kernel: 4.4.0

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 31 (31 by maintainers)

Links to this issue

Learning to operate Kubernetes reliably

Commits related to this issue

Merge pull request #55262 from liggitt/schedulercache Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://gith... — committed to kubernetes/kubernetes by deleted user 7 years ago

Most upvoted comments

Reproducing test and fix in #55262

liggitt on Nov 7, 2017

@wojtek-t I tried for several times but can’t reproduce.

@julia-stripe Any tips about reproducing this issue?

resouer on Aug 21, 2017