cri-o: Sandbox already exists after node reboot
Description
Steps to reproduce the issue:
- Setup Kube with CRI-O 1.11.1
- Reboot a node
Describe the results you received:
After reboot no pods can be created. Kubelet always gets an error from crio:
Aug 16 10:49:35 pharos-worker-0 kubelet[1503]: W0816 10:49:35.119467 1503 status_manager.go:482] Failed to get status for pod "pharos-proxy-pharos-worker-0_kube-system(f2ed2b7c8d4f0463b262f46b7a3856b9)": Get https://localhost:6443/api/v1/namespaces/kube-system/pods/pharos-proxy-pharos-worker-0: dial tcp 127.0.0.1:6443: connect: connection refused
Aug 16 10:49:35 pharos-worker-0 kubelet[1503]: E0816 10:49:35.134458 1503 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = pod sandbox with name "k8s_pharos-proxy-pharos-worker-0_kube-system_f2ed2b7c8d4f0463b262f46b7a3856b9_0" already exists
Aug 16 10:49:35 pharos-worker-0 kubelet[1503]: E0816 10:49:35.134538 1503 kuberuntime_sandbox.go:56] CreatePodSandbox for pod "pharos-proxy-pharos-worker-0_kube-system(f2ed2b7c8d4f0463b262f46b7a3856b9)" failed: rpc error: code = Unknown desc = pod sandbox with name "k8s_pharos-proxy-pharos-worker-0_kube-system_f2ed2b7c8d4f0463b262f46b7a3856b9_0" already exists
Aug 16 10:49:35 pharos-worker-0 kubelet[1503]: E0816 10:49:35.134552 1503 kuberuntime_manager.go:646] createPodSandbox for pod "pharos-proxy-pharos-worker-0_kube-system(f2ed2b7c8d4f0463b262f46b7a3856b9)" failed: rpc error: code = Unknown desc = pod sandbox with name "k8s_pharos-proxy-pharos-worker-0_kube-system_f2ed2b7c8d4f0463b262f46b7a3856b9_0" already exists
CRI-O logs also show some errors after reboot:
-- Logs begin at Thu 2018-08-16 10:49:11 UTC, end at Thu 2018-08-16 11:59:52 UTC. --
Aug 16 10:49:14 pharos-worker-0 systemd[1]: Starting Open Container Initiative Daemon...
Aug 16 10:49:14 pharos-worker-0 sysctl[1490]: net.ipv4.ip_forward = 1
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.809774896Z" level=info msg="[graphdriver] using prior storage driver: overlay"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.819928928Z" level=info msg="CNI network pharos (type=weave-net) is used from /etc/cni/net.d/00-pharos.conflist"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.819951403Z" level=info msg="Initial CNI setting succeeded"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.874817377Z" level=warning msg="could not restore sandbox 563be37a1089ed4e57ccb41fcad923262514189111aee88fd4b03e82af22c3b9 container 563be37a1089ed4e57ccb41fcad923262514189111aee88fd4b03e82af22c3b9: open /var/run/containers/storage/overlay-containers/563be37a1089ed4e57ccb41fcad923262514189111aee88fd4b03e82af22c3b9/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.874882490Z" level=warning msg="could not restore sandbox 45f7b1b7037e9f0c89de29bae6e9851ac4ce94aff11e7ff65ca2f998445d6439 container 45f7b1b7037e9f0c89de29bae6e9851ac4ce94aff11e7ff65ca2f998445d6439: open /var/run/containers/storage/overlay-containers/45f7b1b7037e9f0c89de29bae6e9851ac4ce94aff11e7ff65ca2f998445d6439/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875014642Z" level=warning msg="could not restore sandbox ab131c0e092e9e8347b56910f612bf50b89c6ead0c90bbec09ee95f076f461d2 container ab131c0e092e9e8347b56910f612bf50b89c6ead0c90bbec09ee95f076f461d2: open /var/run/containers/storage/overlay-containers/ab131c0e092e9e8347b56910f612bf50b89c6ead0c90bbec09ee95f076f461d2/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875069151Z" level=warning msg="could not restore sandbox 5a87873e9591cc308c4cad490469f8ddcf54fc71b85d64056c1f873d75f59e58 container 5a87873e9591cc308c4cad490469f8ddcf54fc71b85d64056c1f873d75f59e58: open /var/run/containers/storage/overlay-containers/5a87873e9591cc308c4cad490469f8ddcf54fc71b85d64056c1f873d75f59e58/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875107871Z" level=warning msg="could not restore container d22575619707850de9fc1be3afb2ebf4a21980a82a8c139daf30a3b3cfc90a50: open /var/run/containers/storage/overlay-containers/d22575619707850de9fc1be3afb2ebf4a21980a82a8c139daf30a3b3cfc90a50/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875221658Z" level=warning msg="could not restore container 50255cb7b734aa10719fcc0712e5e6e65d3bc8bbe5b8e4e637304a431c483645: open /var/run/containers/storage/overlay-containers/50255cb7b734aa10719fcc0712e5e6e65d3bc8bbe5b8e4e637304a431c483645/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875336372Z" level=warning msg="could not restore container fb3d0ea85b6ac97a23becefab610a76108be1086c551efb576c9dd468924b706: open /var/run/containers/storage/overlay-containers/fb3d0ea85b6ac97a23becefab610a76108be1086c551efb576c9dd468924b706/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875433800Z" level=warning msg="could not restore container a942ea6b2b97c4528e4d43d8fa9278776e3fc2672838dea16c09678b790f9d6f: open /var/run/containers/storage/overlay-containers/a942ea6b2b97c4528e4d43d8fa9278776e3fc2672838dea16c09678b790f9d6f/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875471488Z" level=warning msg="could not restore container 6bff7a83a7d6d91bbc4fb745ff3649eb1f13ec189dbe387ca8cfda0d30b58a4e: open /var/run/containers/storage/overlay-containers/6bff7a83a7d6d91bbc4fb745ff3649eb1f13ec189dbe387ca8cfda0d30b58a4e/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875559127Z" level=warning msg="could not restore container 65c92394ff515dd385a83874c0faf86450c99ed8c5bd16d36ef9e38e95abfd7c: open /var/run/containers/storage/overlay-containers/65c92394ff515dd385a83874c0faf86450c99ed8c5bd16d36ef9e38e95abfd7c/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 crio[1507]: time="2018-08-16 10:49:14.875684359Z" level=warning msg="could not restore container ab2aefc0221c18153bc1ada4096baee42b918d8fe554b68e746a23731d938e2f: open /var/run/containers/storage/overlay-containers/ab2aefc0221c18153bc1ada4096baee42b918d8fe554b68e746a23731d938e2f/userdata/config.json: no such file or directory"
Aug 16 10:49:14 pharos-worker-0 systemd[1]: Started Open Container Initiative Daemon.
In this case the pod to be created/started was running fine before reboot. It’s a static pod in /etc/kubernetes/manifest/...
We’ve tried many variations in the shutdown process, but cannot seem to get crio to startup properly after reboot.
The only way to “recover” seems to be with:
systemctl stop crio
rm -rf /var/lib/containers
rm -rf /var/run/containers
systemctl start crio
Describe the results you expected:
Kubelet & CRI-O to be able to start/create needed pods after reboot.
Additional information you deem important (e.g. issue happens only occasionally):
Based on our testing, reboots seem to work properly with 1.11.0 release
Output of crio --version:
root@pharos-worker-0:~# crio --version
crio version 1.11.1
Additional environment details (AWS, VirtualBox, physical, etc.):
Tested on DO with Ubuntu Xenial, Ubuntu Bionic & CentOS 7 with the same results.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 22 (14 by maintainers)
Commits related to this issue
- lib: update crio v1.11.1 -> v1.11.2 v1.11.2 includes the following fix: https://github.com/kubernetes-incubator/cri-o/issues/1742 — committed to schu/kubedee by schu 6 years ago
- Update crio v1.11.1 -> v1.11.2 v1.11.2 includes the following fix: https://github.com/kubernetes-incubator/cri-o/issues/1742 — committed to kinvolk-archives/kubernetes-the-hard-way-vagrant by schu 6 years ago
- Update crio v1.11.1 -> v1.11.2 v1.11.2 includes the following fix: https://github.com/kubernetes-incubator/cri-o/issues/1742 — committed to kinvolk-archives/kubernetes-the-hard-way-vagrant by schu 6 years ago
- test: make container runtime configurable, Docker as default For testing purposes it is useful to switch back and forth between Docker and cri-o without having to revert commits. Docker is now again... — committed to intel/oim by pohly 6 years ago
@runcom I think I did try it succesfully, cannot remember for sure. 😃 We’re running and shipping with Pharos 1.11.6 already