kops: CNI+Weave networking and pods stuck in ContainerCreating and Terminating

So, I had a cluster, created with kops 1.5.0-alpha4 and running kubernetes 1.5.3. The cluster was created with CNI networking and then weave was setup with kubectl apply -f https://git.io/weave-kube. It also has private topology and all that. It runs on AWS.

Everything was working as expected for ~2 months.

Then, I decided to upgrade to kubernetes 1.5.6, so I grabbed kops 1.5.3, and did:

$ kops edit cluster $NAME
# changed kubernetes version from 1.5.2 to 1.5.6
$ kops update cluster $NAME --yes
$ kops rolling-update cluster --yes

And it was still ok for a few days.

Then we also decided to change the nodes from t2.large to r3.large spot instances. It worked for a day, and then I saw that kube-dns-* pods were forever on ContainerCreateing, and its failing due to cni config uninitialized; skipping pod.

I tried to force a restart of the pods with kubectl delete pod, but it only made it worse:

kubectl get pods -n kube-system | grep -vi running
NAME                                                    READY     STATUS              RESTARTS   AGE
heapster-564189836-nt32b                                0/1       ContainerCreating   0          1h
kube-dns-782804071-5l26z                                0/4       ContainerCreating   0          3m
kube-dns-782804071-xwh1z                                0/4       Terminating         0          1h
kube-dns-autoscaler-2813114833-1hg98                    0/1       ContainerCreating   0          3m
kube-dns-autoscaler-2813114833-mgxb9                    0/1       Terminating         0          1h
kubernetes-dashboard-3203831700-q14pd                   0/1       ContainerCreating   0          35m
monitoring-influxdb-grafana-v4-q2vx8                    0/2       ContainerCreating   0          1h

I also tried to sequentially kubectl delete pod weave-xyz. The weave pods wen’t up but the others were still failing.

@justinsb asked me on slack to check which weave version I was running, and it was 1.9.4 (I suspect someone upgrade it but no one confess 🤷‍♂️ ).

I also checked that some time, pods were scheduled but would keep in that state forever, without any errors (Successfully assigned hello-2533203682-94wg9 to ip-172-16-62-80.ec2.internal) instead of the previous CNI error.

The nodes were all ok according to kubectl describe node:

Conditions:
 Type            Status    LastHeartbeatTime            LastTransitionTime            Reason                Message
 ----            ------    -----------------            ------------------            ------                -------
 OutOfDisk         False     Tue, 11 Apr 2017 18:30:46 -0300     Tue, 11 Apr 2017 17:35:14 -0300     KubeletHasSufficientDisk     kubelet has sufficient disk space available
 MemoryPressure     False     Tue, 11 Apr 2017 18:30:46 -0300     Tue, 11 Apr 2017 17:35:14 -0300     KubeletHasSufficientMemory     kubelet has sufficient memory available
 DiskPressure         False     Tue, 11 Apr 2017 18:30:46 -0300     Tue, 11 Apr 2017 17:35:14 -0300     KubeletHasNoDiskPressure     kubelet has no disk pressure
 Ready         True     Tue, 11 Apr 2017 18:30:46 -0300     Tue, 11 Apr 2017 17:35:24 -0300     KubeletReady             kubelet is posting ready status

docker ps on the node were a pod were scheduled, would show that half of the pod is there:

3f2f30271982        gcr.io/google_containers/pause-amd64:3.0     "/pause"                 42 minutes ago      Up 42 minutes                           k8s_POD.d8dbe16c_kube-dns-782804071-5k5tz_kube-system_97a905c5-1ef8-11e7-a4f8-0a75bdb13dae_db12b739

docker logs of weave containers didn’t contain anything weird, all looks normal.

Finally, looking into journalctl -u kubelet, we found this:

NetworkPlugin cni failed on the status hook for pod 'kube-dns-782804071-5k5tz' - Unexpected command output Device "eth0" does not exist.

However, ifconfig shows eth0…

@justinsb launched a cluster with the same versions and this bug didn’t happened.

After some time trying to figure it out, we finally manage to fix it:

kubectl delete DaemonSet/weave-net -n kube-system

And kops edit cluster $NAME, changing the networking from CNI to weave directly.

Then, kops update cluster $NAME, relevant parts of the output showing:

  ManagedFile/dev.contaazul.local-addons-bootstrap
  	Contents
  	                    	...
  	                    	        k8s-addon: storage-aws.addons.k8s.io
  	                    	      version: 1.5.0
  	                    	+   - manifest: networking.weave/v1.9.3.yaml
  	                    	+     name: networking.weave
  	                    	+     selector:
  	                    	+       role.kubernetes.io/networking: "1"
  	                    	+     version: 1.9.3

Then, kops update cluster dev.contaazul.local --yes, which says

Cluster changes have been applied to the cloud.

Changes may require instances to restart: kops rolling-update cluster

But then, kops rolling-update cluster --yes shows

Using cluster from kubectl context: $CTX

NAME			STATUS	NEEDUPDATE	READY	MIN	MAX	NODES
master-us-east-1a	Ready	0		1	1	1	1
nodes			Ready	0		3	1	6	3

No rolling-update required.

Either way, it came back to life.

I don’t understand why it stoped working, that’s why I put all information I thought could be relevant.
The rolling-update in the last part may be a bug… ?

If you guys need more info or help with anything, let me know please (also feel free to change the title to something maybe more relevant - I really didn’t know how to name it).

Oh, thanks again @justinsb for helping me fix this 🍻

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 1
Comments: 17 (14 by maintainers)

Most upvoted comments

k8s 1.6.2 is working just fine with flannel 0.7.0 (after compiling kops with your PR updating the CNI package, not before)

I’ve done extensive testing and containers are no long getting stuck in “ContainerCreating”

Containers On Wed, Apr 26, 2017 at 12:39 AM Chris Love notifications@github.com wrote:

@while1eq1 https://github.com/while1eq1 so k8s 1.6.2 is working with flannel? This is closing a blocker item for release … want to make sure we are good to go

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/kubernetes/kops/issues/2346#issuecomment-297235926, or mute the thread https://github.com/notifications/unsubscribe-auth/ACb6wjPZy1VHTZR5XyJQJHU5TaU9v2h9ks5rzspzgaJpZM4M6wxp .

while1eq1 on Apr 26, 2017