cri-o: Getting Intermittent ErrImagePull errors which don't go away once they appear on a given node

What happened?

After updating nodes in our fleet of k8s clusters with an aggregate node count of a few thousand from Kubernetes 1.20/crio v1.20.6 to Kubernetes 1.21/crio 1.21.6, we observe that sometimes nodes will get into a state where a particular image can’t be pulled just to that node. We get error messages like:

kubelet[6135]: E0407 15:26:43.115431    6135 kuberuntime_manager.go:864] init container &Container{Name:chmod-volume,Image:redacted.com/redacted/redacted-image:redacted-tag,Command:[sh -c chmod 1777 /tmp],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{200 -3} {<nil>} 200m DecimalSI},memory: {{209715200 0} {<nil>}  BinarySI},},Requests:ResourceList{cpu: {{100 -3} {<nil>} 100m DecimalSI},memory: {{104857600 0} {<nil>} 100Mi BinarySI},},},VolumeMounts:[]VolumeMount{VolumeMount{Name:tmp-volume,ReadOnly:false,MountPath:/tmp,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:nil,SELinuxOptions:nil,RunAsUser:*0,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:*false,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod redacted-pod_redacted_namespace(8cf5a89c-98a5-4b57-9802-cb72ddf25a4a): ErrImagePull: rpc error: code = Unknown desc = Error committing the finished image: error adding layer with blob "sha256:ecda1bc969d753f5553712497752134d01985271a430d3164c498c4d7a9cae8a": error creating layer with ID "62a09a3264ffd51fb1cd3d9846363049c6f1a68879bb4d7e433d1eb802f8516f": Stat /var/lib/containers/storage/overlay/21721fd694065db2bc02f5cb9e8ff346f37df977cf76aefade8909eda563b367: no such file or directory
kubelet[6135]: E0407 15:26:43.115502    6135 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"chmod-volume\" with ErrImagePull: \"rpc error: code = Unknown desc = Error committing the finished image: error adding layer with blob \\\"sha256:ecda1bc969d753f5553712497752134d01985271a430d3164c498c4d7a9cae8a\\\": error creating layer with ID \\\"62a09a3264ffd51fb1cd3d9846363049c6f1a68879bb4d7e433d1eb802f8516f\\\": Stat /var/lib/containers/storage/overlay/21721fd694065db2bc02f5cb9e8ff346f37df977cf76aefade8909eda563b367: no such file or directory\"" pod="redacted-namespace/redacted_pod" podUID=8cf5a89c-98a5-4b57-9802-cb72ddf25a4a
kubelet[6135]: E0407 15:31:45.238422    6135 kuberuntime_manager.go:864] init container &Container{Name:chmod-volume,Image:redacted.com/redacted/redacted-image:redacted-tag,Command:[sh -c chmod 1777 /tmp],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{200 -3} {<nil>} 200m DecimalSI},memory: {{209715200 0} {<nil>}  BinarySI},},Requests:ResourceList{cpu: {{100 -3} {<nil>} 100m DecimalSI},memory: {{104857600 0} {<nil>} 100Mi BinarySI},},},VolumeMounts:[]VolumeMount{VolumeMount{Name:tmp-volume,ReadOnly:false,MountPath:/tmp,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext{Capabilities:nil,Privileged:nil,SELinuxOptions:nil,RunAsUser:*0,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:*false,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod redacted-pod_redacted_namespace(8cf5a89c-98a5-4b57-9802-cb72ddf25a4a): ErrImagePull: rpc error: code = Unknown desc = Error committing the finished image: error adding layer with blob "sha256:ecda1bc969d753f5553712497752134d01985271a430d3164c498c4d7a9cae8a": error creating layer with ID "62a09a3264ffd51fb1cd3d9846363049c6f1a68879bb4d7e433d1eb802f8516f": Stat /var/lib/containers/storage/overlay/21721fd694065db2bc02f5cb9e8ff346f37df977cf76aefade8909eda563b367: no such file or directory
kubelet[6135]: E0407 15:31:45.238483    6135 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"chmod-volume\" with ErrImagePull: \"rpc error: code = Unknown desc = Error committing the finished image: error adding layer with blob \\\"sha256:ecda1bc969d753f5553712497752134d01985271a430d3164c498c4d7a9cae8a\\\": error creating layer with ID \\\"62a09a3264ffd51fb1cd3d9846363049c6f1a68879bb4d7e433d1eb802f8516f\\\": Stat /var/lib/containers/storage/overlay/21721fd694065db2bc02f5cb9e8ff346f37df977cf76aefade8909eda563b367: no such file or directory\"" pod="redacted-namespace/redacted_pod" podUID=8cf5a89c-98a5-4b57-9802-cb72ddf25a4a

If we attempt to manually pull the image using crictl, we get a similar error:

crictl pull redacted.com/redacted/redacted-image:redacted-tag
FATA[0001] pulling image: rpc error: code = Unknown desc = Error committing the finished image: error adding layer with blob "sha256:ecda1bc969d753f5553712497752134d01985271a430d3164c498c4d7a9cae8a": error creating layer with ID "62a09a3264ffd51fb1cd3d9846363049c6f1a68879bb4d7e433d1eb802f8516f": Stat /var/lib/containers/storage/overlay/21721fd694065db2bc02f5cb9e8ff346f37df977cf76aefade8909eda563b367: no such file or directory

The image in question doesn’t show up in crictl images listings

I can pull the same image from another node on the same cluster using crictl pull

What did you expect to happen?

Image pulls for a pod eventually succeed.

How can we reproduce it (as minimally and precisely as possible)?

Unsure. Some of the worst impacted workloads on these nodes provision quite a few pods very quickly which all reference the same image, so this may be an ingredient. However, we also intermittently see daemonset pods get hit by this same error.

Anything else we need to know?

No response

CRI-O and Kubernetes version

$ crio --version
crio version 1.21.6
Version:       1.21.6
GitCommit:     c03e400ec69fb22256a1b9c8f3ae8011a3539582
GitTreeState:  clean
BuildDate:     2022-03-16T17:45:34Z
GoVersion:     go1.16.12
Compiler:      gc
Platform:      linux/amd64
Linkmode:      dynamic

$ kubectl --version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.10", GitCommit:"a7a32748b5c60445c4c7ee904caf01b91f2dbb71", GitTreeState:"clean", BuildDate:"2022-02-16T11:24:04Z", GoVersion:"go1.16.14", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.10", GitCommit:"a7a32748b5c60445c4c7ee904caf01b91f2dbb71", GitTreeState:"clean", BuildDate:"2022-02-16T11:18:16Z", GoVersion:"go1.16.14", Compiler:"gc", Platform:"linux/amd64"}```
</details>


### OS version

<details>

```console
# On Linux:
$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="8.5 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.5"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.5 (Green Obsidian)"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky Linux"
ROCKY_SUPPORT_PRODUCT_VERSION="8"
$ uname -a
Linux ip-10-117-235-108.eu-west-1.compute.internal 4.18.0-348.20.1.el8_5.x86_64 #1 SMP Thu Mar 10 20:59:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Additional environment details (AWS, VirtualBox, physical, etc.)

Cluster is running on AWS and was deployed using custom Terraform and puppet code.

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 8
Comments: 79 (60 by maintainers)

Most upvoted comments

I had this happen today. I removed the offending entry in layers.json and it appeared to start working again. After removing the entry and re-pulling the image I verified that the entry was back in layers.json.

I did drain the node and clean off all images before putting the node back in service just for good measure, so I don’t really know if things would have gone sideways had I left it in production with active workloads.

To add to the discussion, we have had this happen several times on nodes running cri-o and it seems to be triggered when multiple pods that use the same image start at the (exact) same time on the same node. In this occurrence the image in question was 243MB. Pulling the image fresh on an identical node in the same datacenter took 6.3 seconds.

joshcoughlan on Apr 27, 2023

containers/image and blob-info-cache-v1.boltdb are irrelevant here; this is a containers/storage crash recovery and/or bookkeeping problem.

Typically this could happen with CRI-O if a pull is interrupted, (by the client aborting the pull — is that possible?), by terminating the CRI-O process, or possibly by rebooting the node, in the middle of an image pull.

There have been significant improvements in this area late last year; that requires containers/storage at least 1.45.0 (with ~no practical possibility to backport due to size of the changes).

So that means CRI-O ≥ 1.26.2 , and from my point of view there’s quite a low likelihood of anything to be learned from diagnosing failures in any earlier version.

Also, in these kinds of failures, to diagnose a root cause, it would be necessary to collect a lot of metadata from /var/lib/containers/storage, ideally everything but the layer contents; that would at least allow determining what is the failure. To see how that failure happened would require collecting logs some way back, certainly before the failure first manifests. And the logging before c/storage ~1.45.0 is known to be insufficient to determine the root cause, and due to lack of recent experience I’m not sure the c/storage ≥ 1.45.0 logging is sufficient.

mtrmac on May 17, 2023

Are the fetched layer IDs stored in that boltdb database? If so, is it possible for this to be a data race?

Yes, it appears so. I was trying to debug “where could this bad state possibly be stored?” I grepped through the storage directories to figure out where these IDs were coming from and the only place on disk was blob-info-cache-v1.boltdb.

I used a random tool to dump the db to json. For the error:

Error: writing blob: adding layer with blob "sha256:34ec39b6e3f542082149b23b1d0481083a5ad896dc929138818214d6c4a57097": error creating layer with ID "a39bcc2ef8e5aec7af3980256ac406e71b87888d67d21e108cbe1d1b3dc6d419": Stat /usr/share/hubspot/containers/storage/overlay/662cc199e4b7b3b99cf2de5aee17c2000f2772886aca87ffe96e56c6c32ae8a8: no such file or directory

We can see that there is a mapping for sha256:34ec39b6e3f542082149b23b1d0481083a5ad896dc929138818214d6c4a57097 to its uncompressed digest and a reverse mapping too. There is also an entry for it in a knownLocations map.

That said, I blew away that cache file and attempted another pull with podman and it did not fix the problem so the bad state must not be in that file.

Digging a little further, the 662cc199e4b7b3b99cf2de5aee17c2000f2772886aca87ffe96e56c6c32ae8a8 ID appears to be stored in /var/lib/containers/storage/overlay-layers/layers.json:

    {
        "id": "662cc199e4b7b3b99cf2de5aee17c2000f2772886aca87ffe96e56c6c32ae8a8",
        "parent": "34bdf373ca69e38d30cf57c1ae2e66a71627aead1eb206aabef7efba3ec9bd62",
        "created": "2023-03-30T12:31:12.235362536Z",
        "compressed-diff-digest": "sha256:d842307651c36561618eddb5020df85d858c89036e79b45b4c74082835605790",
        "compressed-size": 5697001,
        "diff-digest": "sha256:48252b6983a41f51d71cb60d1517450ea59873113981ed660f6337af0c7ac9e8",
        "diff-size": 24644608,
        "compression": 2,
        "uidset": [
            0
        ],
        "gidset": [
            0
        ]
    }

If I manually remove that entry from layers.json, podman happily pulls the layer, so that’s where our problematic state actually is.

Looking at the code that manages that file, it does seem like there are plenty of opportunities for race conditions/locking bugs, for example: https://github.com/containers/storage/blob/0f3b7423dca78dec2bfd47fb793586dcf5350406/layers.go#L559-L588

It appears that code puts a lot of effort into avoiding re-reading and re-writing layers.json unnecessarily. If this bug doesn’t occur on podman systems, that may be the answer here. As far as I know, the podman processes are stateless so every time a pull occurs, layers.json would be re-read, whereas in the cri-o case, cri-o could have some stale state in memory.

Perhaps a reproducer is to just have a bunch of processes use cri-o to pull and remove images that have shared layers in parallel.

PaulFurtado on Apr 3, 2023