cri-o: creating read-write layer with ID No such file or directory - crio reading from overlay instead of overlay2?

Description

We have a kubernetes cluster (1.16.9) in which one of the nodes is based on cri-o. (cri-o://1.16.6). From time to time there is a weird error blocking pods from getting up:

Warning  FailedCreatePodSandBox  2s (x4 over 43s)  kubelet, kube6     (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name "some-sandbox-name": error creating read-write layer with ID "a5021e65186da551b712f7dd743d712833e5f75fc727c6f937d421897d2eb9d6": Stat /var/lib/containers/storage/overlay/e17133b79956ad6f69ae7f775badd1c11bad2fc64f0529cab863b9d12fbaa5c4: no such file or directory

When I check that path, it doesn’t exist indeed, but:

crio is set to use overlay2, so I’m not sure why it tries to load the layer from /var/lib/containers/storage/overlay
When I check that path, but in overlay2 - /var/lib/containers/storage/overlay2/e17133b79956ad6f69ae7f775badd1c11bad2fc64f0529cab863b9d12fbaa5c4 - it does exist.

Is this some stale layer ref issue? if so, where should I look to clean it? Can crictl perform validation of layer tree and remove stale data? What are other reasons for such behaviour?

Steps to reproduce the issue:

Create some pod?

Describe the results you received: Pod never starts

Describe the results you expected: Pod should run

Additional information you deem important (e.g. issue happens only occasionally):

Output of crio --version:

crio version 1.16.6
commit: "af8faf448858335f9645b896120167d08caf7156-dirty"

Our cluster runs on bare metal. The node with crio on it is:

NAME    STATUS                     ROLES    AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
...
kube6   Ready                      node     512d    v1.16.9   10.200.0.15   <none>        Ubuntu 18.04.4 LTS   5.3.0-51-generic    cri-o://1.16.6
...

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 51 (24 by maintainers)

Most upvoted comments

OK: full system reset then. I would stop cri-o, reboot the node, rm -rf /var/{run,lib}/containers, then start kubelet and cri-o.

haircommander on May 19, 2020