kubernetes: Scheduler dies with "Schedulercache is corrupted"

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

/sig scheduling

What happened:

The scheduler panics with “Schedulercache is corrupted and can badly affect scheduling decisions”. There are a few related issues reporting a similar bug (#46347, #47150), but in both of those issues the recommended fix is to upgrade to etcd 3.0.17. We’re using etcd 3.2.0 which is newer than that. This happened in our production cluster twice today.

some evidence that this is issue an with etcd:

  1. scheduler fails with “Schedulercache is corrupted and can badly affect scheduling decisions”
  2. another scheduler on another machine gets elected the leader
  3. that scheduler also fails with “Schedulercache is corrupted and can badly affect scheduling decisions”

Conversely, evidence that it isn’t an issue with etcd: when we later restart the scheduler (~30 minutes later), the system recovers without any changes to etcd (according to the apiserver lots). So etcd’s contents are at least not permanently corrupted.

Here’s an excerpt of the logs. full logs are in this gist (all the relevant kubelet/controller manager/apiserver/scheduler logs)

======== scheduler (kubemaster--0fd4bec98eba7656c) ======

I0818 09:16:25.678670       1 factory.go:511] About to try and schedule pod explorer-job-1503047460-7681l
I0818 09:16:25.678677       1 scheduler.go:261] Attempting to schedule pod: scheduled-jobs/explorer-job-1503047460-7681l
W0818 09:16:25.679040       1 cache.go:228] Pod scheduled-jobs/explorer-job-1503047460-7681l assumed to a different node than added to.
I0818 09:16:25.679059       1 factory.go:725] Attempting to bind explorer-job-1503047460-7681l to ip-10-68-30-4.us-west-2.compute.internal
E0818 09:16:25.679061       1 cache.go:263] Pod scheduled-jobs/explorer-job-1503047460-7681l updated on a different node than previously added to.
W0818 09:16:25.679040       1 cache.go:228] Pod scheduled-jobs/explorer-job-1503047460-7681l assumed to a different node than added to.
E0818 09:16:25.679061       1 cache.go:263] Pod scheduled-jobs/explorer-job-1503047460-7681l updated on a different node than previously added to.
F0818 09:16:25.679081       1 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions


>>>> stack trace redacted: panic <<<<

======== apiserver audit ======

2017-08-18T09:16:25.679953099Z AUDIT: id="1e0516d4-9a95-4b4b-aaee-98a0894b3625" ip="10.68.24.10" method="POST" user="kubemaster--0fd4bec98eba7656c.nw" groups="\"kubemaster.northwest.prod\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="scheduled-jobs" uri="/api/v1/namespaces/scheduled-jobs/pods/explorer-job-1503047460-7681l/binding"

======== scheduler (kubemaster--0aa8e87a222ebc781.nw) ======

I0818 09:16:45.738257       1 leaderelection.go:189] successfully acquired lease kube-system/kube-scheduler
I0818 09:16:45.738413       1 event.go:218] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"kube-scheduler", UID:"3d55f039-8367-11e7-8b13-02708673f6f6", APIVersion:"v1", ResourceVersion:"737096", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' kubemaster--0aa8e87a222ebc781.nw became leader
I0818 09:16:45.838560       1 shared_informer.go:116] caches populated

I0818 09:19:22.361105       1 leaderelection.go:204] successfully renewed lease kube-system/kube-scheduler
I0818 09:19:24.369669       1 leaderelection.go:204] successfully renewed lease kube-system/kube-scheduler
I0818 09:19:25.338146       1 reflector.go:405] external/com_github_kubernetes/plugin/cmd/kube-scheduler/app/server.go:100: Watch close - *v1.Pod total 21 items received
I0818 09:19:25.340488       1 factory.go:511] About to try and schedule pod explorer-job-1503047460-7681l
I0818 09:19:25.340495       1 scheduler.go:261] Attempting to schedule pod: scheduled-jobs/explorer-job-1503047460-7681l
W0818 09:19:25.340723       1 cache.go:228] Pod scheduled-jobs/explorer-job-1503047460-7681l assumed to a different node than added to.
E0818 09:19:25.340741       1 cache.go:263] Pod scheduled-jobs/explorer-job-1503047460-7681l updated on a different node than previously added to.
F0818 09:19:25.340759       1 cache.go:264] Schedulercache is corrupted and can badly affect scheduling decisions

Environment:

  • Kubernetes version (use kubectl version): 1.7.4
  • Etcd version: 3.2.0-rc.1+git
  • Cloud provider or hardware configuration**: AWS
  • OS: Ubuntu 15.04
  • Kernel: 4.4.0

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 31 (31 by maintainers)

Commits related to this issue

Most upvoted comments

Reproducing test and fix in #55262

@wojtek-t I tried for several times but can’t reproduce.

@julia-stripe Any tips about reproducing this issue?