kops: BUG REPORT:kube-dns stuck at Error syncing pod/Pod sandbox changed, it will be killed and re-created
–BUG REPORT – Initial kops cluster has issue with kube-dns; kube-dns stuck at Error syncing pod/Pod sandbox changed, it will be killed and re-created.
kops command
kops create cluster --cloud=aws --zones=$AWS_ZONE \
--name=$CLUSTER_NAME \
--network-cidr=${NETWORK_CIDR} --vpc=${VPC_ID} \
--bastion=true --topology=private --networking=calico \
--dns-zone=${DNS_ZONE}
kops version
Version 1.7.0 (git-e04c29d)
kubectl version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5", GitCommit:"17d7182a7ccbb167074be7a87f0a68bd00d58d97", GitTreeState:"clean", BuildDate:"2017-08-31T09:14:02Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.2", GitCommit:"922a86cfcd65915a9b2f69f3f193b8907d741d9c", GitTreeState:"clean", BuildDate:"2017-07-21T08:08:00Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
cloud provider: AWS
admin@ip-172-17-3-61:~$ kubectl get events --all-namespaces
NAMESPACE LASTSEEN FIRSTSEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
kube-system 18s 1h 204 kube-dns-479524115-h5sxc Pod Warning FailedSync kubelet, ip-172-17-3-61.ec2.internal Error syncing pod
kube-system 17s 1h 203 kube-dns-479524115-h5sxc Pod Normal SandboxChanged kubelet, ip-172-17-3-61.ec2.internal Pod sandbox changed, it will be killed and re-created.
kube-system 9s 1h 209 kube-dns-autoscaler-1818915203-7j0cx Pod Warning FailedSync kubelet, ip-172-17-3-61.ec2.internal Error syncing pod
kube-system 9s 1h 205 kube-dns-autoscaler-1818915203-7j0cx Pod Normal SandboxChanged kubelet, ip-172-17-3-61.ec2.internal Pod sandbox changed, it will be killed and re-created.
kube-system 3m 4d 1405 kube-proxy-ip-172-17-3-61.ec2.internal Pod spec.containers{kube-proxy} Normal Created kubelet, ip-172-17-3-61.ec2.internal Created container
kube-system 3m 4d 1405 kube-proxy-ip-172-17-3-61.ec2.internal Pod spec.containers{kube-proxy} Normal Started kubelet, ip-172-17-3-61.ec2.internal Started container
kube-system 3m 4d 1404 kube-proxy-ip-172-17-3-61.ec2.internal Pod spec.containers{kube-proxy} Normal Pulled kubelet, ip-172-17-3-61.ec2.internal Container image "gcr.io/google_containers/kube-proxy:v1.7.2" already present on machine
kube-system 9s 4d 32243 kube-proxy-ip-172-17-3-61.ec2.internal Pod spec.containers{kube-proxy} Warning BackOff kubelet, ip-172-17-3-61.ec2.internal Back-off restarting failed container
kube-system 9s 4d 32243 kube-proxy-ip-172-17-3-61.ec2.internal Pod Warning FailedSync kubelet, ip-172-17-3-61.ec2.internal Error syncing pod
kube-system 18s 4d 13683 kubernetes-dashboard-4056215011-05kjw Pod Warning FailedSync kubelet, ip-172-17-3-61.ec2.internal Error syncing pod
kube-system 17s 4d 13628 kubernetes-dashboard-4056215011-05kjw Pod Normal SandboxChanged kubelet, ip-172-17-3-61.ec2.internal Pod sandbox changed, it will be killed and re-created.
p.s. I had to change the taint on the master node to get past the initial error message of No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (1). This seems like a bad choice for default?
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 10
- Comments: 15 (4 by maintainers)
I had the same problem going while running 1.7.11 using weave. This began all of the sudden which is scary as even though this happened on staging, my production environment has exactly the same setup. Pods stuck on ContainerCreating.
I tried going from 1.7.11 -> 1.8.4 on a desperate attempt to get things working again, but things remained the same.
This fix was suggested to me by @hubt on the #kops slack channel. It boils down to upgrading to weave 2.1.3
Daemonsetonkube-systemnamespace.kubectl delete -f weave-daemonset-k8s-1.7.yamlas rolebindings are not exactly the same, but I am not 100% of this step.kubectl create -f weave-daemonset-k8s-1.7.yaml.Still, it is very frustrating not knowing what is the reason. I suspect it might be related to https://github.com/weaveworks/weave/issues/2822 as I saw the
Unexpected command output Device "eth0" does not exist.Message several times, and also checking the IPAM service as suggested https://github.com/weaveworks/weave/issues/2822#issuecomment-283113983 gives similar output.Anyone figure out a root cause? I am seeing two different cni providers, so I think it is not the providers. Different OS, so not CoreOS or Debian. I am thinking docker or k8s maybe. Anyone find anything in the logs? Anyone have a repeatable set of commands to recreate this? I am seeing even kubeadm mentioned, so I am guessing this is not kops.