kops: BUG REPORT:kube-dns stuck at Error syncing pod/Pod sandbox changed, it will be killed and re-created

–BUG REPORT – Initial kops cluster has issue with kube-dns; kube-dns stuck at Error syncing pod/Pod sandbox changed, it will be killed and re-created.

kops command

 kops create cluster --cloud=aws --zones=$AWS_ZONE \ 
--name=$CLUSTER_NAME \
--network-cidr=${NETWORK_CIDR} --vpc=${VPC_ID} \
--bastion=true --topology=private --networking=calico \
--dns-zone=${DNS_ZONE}

kops version Version 1.7.0 (git-e04c29d)

kubectl version

Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5", GitCommit:"17d7182a7ccbb167074be7a87f0a68bd00d58d97", GitTreeState:"clean", BuildDate:"2017-08-31T09:14:02Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.2", GitCommit:"922a86cfcd65915a9b2f69f3f193b8907d741d9c", GitTreeState:"clean", BuildDate:"2017-07-21T08:08:00Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

cloud provider: AWS

admin@ip-172-17-3-61:~$ kubectl get events --all-namespaces
NAMESPACE     LASTSEEN   FIRSTSEEN   COUNT     NAME                                     KIND      SUBOBJECT                     TYPE      REASON           SOURCE                                 MESSAGE
kube-system   18s        1h          204       kube-dns-479524115-h5sxc                 Pod                                     Warning   FailedSync       kubelet, ip-172-17-3-61.ec2.internal   Error syncing pod
kube-system   17s        1h          203       kube-dns-479524115-h5sxc                 Pod                                     Normal    SandboxChanged   kubelet, ip-172-17-3-61.ec2.internal   Pod sandbox changed, it will be killed and re-created.
kube-system   9s         1h          209       kube-dns-autoscaler-1818915203-7j0cx     Pod                                     Warning   FailedSync       kubelet, ip-172-17-3-61.ec2.internal   Error syncing pod
kube-system   9s         1h          205       kube-dns-autoscaler-1818915203-7j0cx     Pod                                     Normal    SandboxChanged   kubelet, ip-172-17-3-61.ec2.internal   Pod sandbox changed, it will be killed and re-created.
kube-system   3m         4d          1405      kube-proxy-ip-172-17-3-61.ec2.internal   Pod       spec.containers{kube-proxy}   Normal    Created          kubelet, ip-172-17-3-61.ec2.internal   Created container
kube-system   3m         4d          1405      kube-proxy-ip-172-17-3-61.ec2.internal   Pod       spec.containers{kube-proxy}   Normal    Started          kubelet, ip-172-17-3-61.ec2.internal   Started container
kube-system   3m         4d          1404      kube-proxy-ip-172-17-3-61.ec2.internal   Pod       spec.containers{kube-proxy}   Normal    Pulled           kubelet, ip-172-17-3-61.ec2.internal   Container image "gcr.io/google_containers/kube-proxy:v1.7.2" already present on machine
kube-system   9s         4d          32243     kube-proxy-ip-172-17-3-61.ec2.internal   Pod       spec.containers{kube-proxy}   Warning   BackOff          kubelet, ip-172-17-3-61.ec2.internal   Back-off restarting failed container
kube-system   9s         4d          32243     kube-proxy-ip-172-17-3-61.ec2.internal   Pod                                     Warning   FailedSync       kubelet, ip-172-17-3-61.ec2.internal   Error syncing pod
kube-system   18s        4d          13683     kubernetes-dashboard-4056215011-05kjw    Pod                                     Warning   FailedSync       kubelet, ip-172-17-3-61.ec2.internal   Error syncing pod
kube-system   17s        4d          13628     kubernetes-dashboard-4056215011-05kjw    Pod                                     Normal    SandboxChanged   kubelet, ip-172-17-3-61.ec2.internal   Pod sandbox changed, it will be killed and re-created.

p.s. I had to change the taint on the master node to get past the initial error message of No nodes are available that match all of the following predicates:: PodToleratesNodeTaints (1). This seems like a bad choice for default?

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 10
Comments: 15 (4 by maintainers)

Most upvoted comments

I had the same problem going while running 1.7.11 using weave. This began all of the sudden which is scary as even though this happened on staging, my production environment has exactly the same setup. Pods stuck on ContainerCreating.

I tried going from 1.7.11 -> 1.8.4 on a desperate attempt to get things working again, but things remained the same.

This fix was suggested to me by @hubt on the #kops slack channel. It boils down to upgrading to weave 2.1.3

Delete the weave Daemonset on kube-system namespace.
Download the manifest from https://github.com/weaveworks/weave/releases/download/v2.1.3/weave-daemonset-k8s-1.7.yaml
I had do a kubectl delete -f weave-daemonset-k8s-1.7.yaml as rolebindings are not exactly the same, but I am not 100% of this step.
Then create the new Daemonset, sa and rolebindings with kubectl create -f weave-daemonset-k8s-1.7.yaml.

Still, it is very frustrating not knowing what is the reason. I suspect it might be related to https://github.com/weaveworks/weave/issues/2822 as I saw the Unexpected command output Device "eth0" does not exist. Message several times, and also checking the IPAM service as suggested https://github.com/weaveworks/weave/issues/2822#issuecomment-283113983 gives similar output.

AlexRRR on Dec 19, 2017

Anyone figure out a root cause? I am seeing two different cni providers, so I think it is not the providers. Different OS, so not CoreOS or Debian. I am thinking docker or k8s maybe. Anyone find anything in the logs? Anyone have a repeatable set of commands to recreate this? I am seeing even kubeadm mentioned, so I am guessing this is not kops.

chrislovecnm on Feb 16, 2018