kops: CNI+Weave networking and pods stuck in ContainerCreating and Terminating
So, I had a cluster, created with kops 1.5.0-alpha4 and running kubernetes 1.5.3.
The cluster was created with CNI networking and then weave was setup with kubectl apply -f https://git.io/weave-kube.
It also has private topology and all that. It runs on AWS.
Everything was working as expected for ~2 months.
Then, I decided to upgrade to kubernetes 1.5.6, so I grabbed kops 1.5.3, and did:
$ kops edit cluster $NAME
# changed kubernetes version from 1.5.2 to 1.5.6
$ kops update cluster $NAME --yes
$ kops rolling-update cluster --yes
And it was still ok for a few days.
Then we also decided to change the nodes from t2.large to r3.large spot instances. It worked for a day, and then I saw that kube-dns-* pods were forever on ContainerCreateing, and its failing due to cni config uninitialized; skipping pod.
I tried to force a restart of the pods with kubectl delete pod, but it only made it worse:
kubectl get pods -n kube-system | grep -vi running
NAME READY STATUS RESTARTS AGE
heapster-564189836-nt32b 0/1 ContainerCreating 0 1h
kube-dns-782804071-5l26z 0/4 ContainerCreating 0 3m
kube-dns-782804071-xwh1z 0/4 Terminating 0 1h
kube-dns-autoscaler-2813114833-1hg98 0/1 ContainerCreating 0 3m
kube-dns-autoscaler-2813114833-mgxb9 0/1 Terminating 0 1h
kubernetes-dashboard-3203831700-q14pd 0/1 ContainerCreating 0 35m
monitoring-influxdb-grafana-v4-q2vx8 0/2 ContainerCreating 0 1h
I also tried to sequentially kubectl delete pod weave-xyz. The weave pods wen’t up but the others were still failing.
@justinsb asked me on slack to check which weave version I was running, and it was 1.9.4 (I suspect someone upgrade it but no one confess 🤷♂️ ).
I also checked that some time, pods were scheduled but would keep in that state forever, without any errors (Successfully assigned hello-2533203682-94wg9 to ip-172-16-62-80.ec2.internal) instead of the previous CNI error.
The nodes were all ok according to kubectl describe node:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Tue, 11 Apr 2017 18:30:46 -0300 Tue, 11 Apr 2017 17:35:14 -0300 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Tue, 11 Apr 2017 18:30:46 -0300 Tue, 11 Apr 2017 17:35:14 -0300 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 11 Apr 2017 18:30:46 -0300 Tue, 11 Apr 2017 17:35:14 -0300 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Tue, 11 Apr 2017 18:30:46 -0300 Tue, 11 Apr 2017 17:35:24 -0300 KubeletReady kubelet is posting ready status
docker ps on the node were a pod were scheduled, would show that half of the pod is there:
3f2f30271982 gcr.io/google_containers/pause-amd64:3.0 "/pause" 42 minutes ago Up 42 minutes k8s_POD.d8dbe16c_kube-dns-782804071-5k5tz_kube-system_97a905c5-1ef8-11e7-a4f8-0a75bdb13dae_db12b739
docker logs of weave containers didn’t contain anything weird, all looks normal.
Finally, looking into journalctl -u kubelet, we found this:
NetworkPlugin cni failed on the status hook for pod 'kube-dns-782804071-5k5tz' - Unexpected command output Device "eth0" does not exist.
However, ifconfig shows eth0…
@justinsb launched a cluster with the same versions and this bug didn’t happened.
After some time trying to figure it out, we finally manage to fix it:
kubectl delete DaemonSet/weave-net -n kube-system
And kops edit cluster $NAME, changing the networking from CNI to weave directly.
Then, kops update cluster $NAME, relevant parts of the output showing:
ManagedFile/dev.contaazul.local-addons-bootstrap
Contents
...
k8s-addon: storage-aws.addons.k8s.io
version: 1.5.0
+ - manifest: networking.weave/v1.9.3.yaml
+ name: networking.weave
+ selector:
+ role.kubernetes.io/networking: "1"
+ version: 1.9.3
Then, kops update cluster dev.contaazul.local --yes, which says
Cluster changes have been applied to the cloud.
Changes may require instances to restart: kops rolling-update cluster
But then, kops rolling-update cluster --yes shows
Using cluster from kubectl context: $CTX
NAME STATUS NEEDUPDATE READY MIN MAX NODES
master-us-east-1a Ready 0 1 1 1 1
nodes Ready 0 3 1 6 3
No rolling-update required.
Either way, it came back to life.
- I don’t understand why it stoped working, that’s why I put all information I thought could be relevant.
- The
rolling-updatein the last part may be a bug… ?
If you guys need more info or help with anything, let me know please (also feel free to change the title to something maybe more relevant - I really didn’t know how to name it).
Oh, thanks again @justinsb for helping me fix this 🍻
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 1
- Comments: 17 (14 by maintainers)
k8s 1.6.2 is working just fine with flannel 0.7.0 (after compiling kops with your PR updating the CNI package, not before)
I’ve done extensive testing and containers are no long getting stuck in “ContainerCreating”
Containers On Wed, Apr 26, 2017 at 12:39 AM Chris Love notifications@github.com wrote: