kubernetes: Kube-scheduler dies with "Schedulercache is corrupted"
BUG REPORT
Kubernetes version (use kubectl version
):
kubectl version
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.1", GitCommit:"b0b7a323cc5a4a2019b2e9520c21c7830b7f708e", GitTreeState:"clean", BuildDate:"2017-04-03T20:44:38Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.3+coreos.1", GitCommit:"bc000d3336a0b11155ac222193e6f24b6dcb5cd1", GitTreeState:"clean", BuildDate:"2017-05-19T00:19:21Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Environment:
- Self-hosted:
- OS Container Linux by CoreOS 1353.7.0 (Master + most nodes) and Container Linux by CoreOS 1298.5.0 (some nodes)
- Kernel
1353.7.0:
(4.9.24-coreos #1 SMP Wed Apr 26 21:44:23 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz GenuineIntel GNU/Linux)
1298.5.0:(Linux alien8 4.9.9-coreos-r1 #1 SMP Tue Feb 28 00:06:10 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz GenuineIntel GNU/Linux)
- Install tools: Hyperkube (https://coreos.com/kubernetes/docs/latest/getting-started.html)
- Others:
- The underlying Hypervisor is ESXi 5.5
- Kubelet is running in rkt, other kubernetes components (incl. Scheduler) are running in docker
- The etcd cluster (3 Nodes) is v3.0.10, running inside of rkt containers (using the CoreOS etcd-member.service)
What happened:
After some testing with my private image registry (using the docs at https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/) I had some problems with the kube-scheduler. Creating a new pod e.g.
apiVersion: v1
kind: Pod
metadata:
name: privateimage-test
spec:
containers:
- name: private-container
image: server:4567/namespace/container:tag
imagePullSecrets:
- name: privateregkey
kills the kube-scheduler
I0524 13:22:55.458837 1 event.go:217] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"privateimage-test", UID:"ff287e49-4083-11e7-8c95-005056b70482", APIVersion:"v1", ResourceVersion:"4460349", FieldPath:""}): type: 'Normal' reason: 'Scheduled' Successfully assigned privateimage-test to alien6
E0524 13:29:33.440433 1 cache.go:290] Pod default/privateimage-test removed from a different node than previously added to.
F0524 13:29:33.440454 1 cache.go:291] Schedulercache is corrupted and can badly affect scheduling decisions
the kube-scheduler is restarted by kubelet and the pod is started after that.
Some output from kubelet
May 24 15:29:34 alien1 kubelet-wrapper[1634]: E0524 13:29:34.085188 1634 event.go:259] Could not construct reference to: '&v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"kube-scheduler-alien1", UID:"cd159dc8e8cca207146c55f953a24533", APIVersion:"v1", ResourceVersion:"", FieldPath:"spec.containers{kube-scheduler}"}' due to: 'object does not implement the List interfaces'. Will not report event: 'Warning' 'Unhealthy' 'Liveness probe failed: Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: getsockopt: connection refused'
May 24 15:29:34 alien1 kubelet-wrapper[1634]: I0524 13:29:34.407447 1634 kuberuntime_manager.go:458] Container {Name:kube-scheduler Image:quay.io/coreos/hyperkube:v1.6.3_coreos.1 Command:[/hyperkube scheduler --master=http://127.0.0.1:8080 --leader-elect=true] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:10251,Host:127.0.0.1,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:15,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
May 24 15:29:34 alien1 kubelet-wrapper[1634]: I0524 13:29:34.407623 1634 kuberuntime_manager.go:742] checking backoff for container "kube-scheduler" in pod "kube-scheduler-alien1_kube-system(cd159dc8e8cca207146c55f953a24533)"
May 24 15:29:34 alien1 kubelet-wrapper[1634]: I0524 13:29:34.407815 1634 kuberuntime_manager.go:752] Back-off 1m20s restarting failed container=kube-scheduler pod=kube-scheduler-alien1_kube-system(cd159dc8e8cca207146c55f953a24533)
May 24 15:29:34 alien1 kubelet-wrapper[1634]: E0524 13:29:34.407895 1634 pod_workers.go:182] Error syncing pod cd159dc8e8cca207146c55f953a24533 ("kube-scheduler-alien1_kube-system(cd159dc8e8cca207146c55f953a24533)"), skipping: failed to "StartContainer" for "kube-scheduler" with CrashLoopBackOff: "Back-off 1m20s restarting failed container=kube-scheduler pod=kube-scheduler-alien1_kube-system(cd159dc8e8cca207146c55f953a24533)"
May 24 15:29:44 alien1 kubelet-wrapper[1634]: I0524 13:29:44.085354 1634 kuberuntime_manager.go:458] Container {Name:kube-scheduler Image:quay.io/coreos/hyperkube:v1.6.3_coreos.1 Command:[/hyperkube scheduler --master=http://127.0.0.1:8080 --leader-elect=true] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:10251,Host:127.0.0.1,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:15,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
May 24 15:29:44 alien1 kubelet-wrapper[1634]: I0524 13:29:44.085464 1634 kuberuntime_manager.go:742] checking backoff for container "kube-scheduler" in pod "kube-scheduler-alien1_kube-system(cd159dc8e8cca207146c55f953a24533)"
May 24 15:29:44 alien1 kubelet-wrapper[1634]: I0524 13:29:44.085588 1634 kuberuntime_manager.go:752] Back-off 1m20s restarting failed container=kube-scheduler pod=kube-scheduler-alien1_kube-system(cd159dc8e8cca207146c55f953a24533)
May 24 15:29:44 alien1 kubelet-wrapper[1634]: E0524 13:29:44.085627 1634 pod_workers.go:182] Error syncing pod cd159dc8e8cca207146c55f953a24533 ("kube-scheduler-alien1_kube-system(cd159dc8e8cca207146c55f953a24533)"), skipping: failed to "StartContainer" for "kube-scheduler" with CrashLoopBackOff: "Back-off 1m20s restarting failed container=kube-scheduler pod=kube-scheduler-alien1_kube-system(cd159dc8e8cca207146c55f953a24533)"
The journal entry right before that shows:
May 24 15:29:33 alien1 dockerd[1757]: time="2017-05-24T15:29:33.460151779+02:00" level=error msg="Error closing logger: invalid argument"
Creating a new pod with sth. like /kubectl run busybox --image busybox /bin/sh
does not kill the scheduler!
What you expected to happen: The Scheduler should not die 😃
How to reproduce it :
Not sure, the cluster worked ~ 30days without problems. The problem started after working with my private image registry. Before that I updated k8s from 1.6.1 to .1.6.3.
Anything else we need to know:
I restarted the Master after the problem first occurred (that helped before with some other problems)
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 25 (20 by maintainers)
@smarterclayton @timothysc - can you clarify why do you think this PR can help with this problem? I looked into it again, and I don’t see any place where we didn’t do copy and we are doing now. I agree it fixes watch semantics, but this is unrelated to this issue.