containerd: Failed to pull image (unexpected commit digest)
Description
A container image sometimes fails to be pulled from a remote repository.
For example, when I made a k8s DaemonSet, 4 Pods were created successfully but 1 was not. When I deleted the error Pod manually, the image was re-pulled successfully, but itโs never re-pulled unless the Pod is deleted manually.
Steps to reproduce the issue:
- Download this manifest template.
- Replace
{{ .unbound }}
withquay.io/cybozu/unbound:1.9.5.1
(cf. this image is built with this Dockerfile). - Apply the manifest to a k8s cluster of which workers use
containerd
as a container runtime.
Describe the results you received:
Sometimes, a Pod falls into the ImagePullBackoff
status with the following messages.
$ kubectl -n kube-system describe po node-dns-mnw9k
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 30m default-scheduler Successfully assigned kube-system/node-dns-mnw9k to 10.0.0.101
Warning FailedMount 30m (x3 over 30m) kubelet, 10.0.0.101 MountVolume.SetUp failed for volume "config-volume" : configmap "node-dns" not found
Normal Pulling 30m kubelet, 10.0.0.101 Pulling image "quay.io/cybozu/unbound:1.9.5.1"
Warning Failed 28m (x2 over 28m) kubelet, 10.0.0.101 Failed to pull image "quay.io/cybozu/unbound:1.9.5.1": rpc error: code = FailedPrecondition desc = failed to pull and unpack image "quay.io/cybozu/unbound:1.9.5.1": failed commit on ref "layer-sha256:35c102085707f703de2d9eaad8752d6fe1b8f02b5d2149f1d8357c9cc7fb7d0a": unexpected commit digest sha256:e3c9baab3234687950948685f582f713f19c85a3f989ef797ead63bb084e25b6, expected sha256:35c102085707f703de2d9eaad8752d6fe1b8f02b5d2149f1d8357c9cc7fb7d0a: failed precondition
Warning Failed 28m (x2 over 28m) kubelet, 10.0.0.101 Error: ErrImagePull
Describe the results you expected:
The image is pulled successfully.
Output of containerd --version
:
containerd github.com/containerd/containerd v1.3.2 ff48f57fc83a8c44cf4ad5d672424a98ba37ded6
Any other relevant information:
The same problem was reported and seems to have been solved here about 2 years ago, but recently one developer has reported it again.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 27 (10 by maintainers)
I find a way to solve this temporarily.
Just clean the image layer cache folder in
${containerd folder}/io.containerd.content.v1.content/ingest
.Containerd would not clean this cache automatically when some layer data broken.
https://github.com/containerd/containerd/blob/v1.5.2/content/local/store.go#L503
errors.Wrapf on a nil error will return nil, so this caused the error case where ref is incorrect to be misinterpreted as a success case.
Hi @fuweid ๐ I was able to reproduce it locally using toxiproxy .
The issue is: the connection is getting closed. I can reproduce this by using toxiproxy to close the connection after X bytes.
Iโm using the following Dockerfile:
Build and start the image:
Inside the container start Iโm:
/etc/hosts
to pull the image through the tcp proxy (viastorage.googleapis.com
)I hope this helps ๐
Thank you so much! That helped also for me! For reference, in k3s the folder is:
/var/lib/rancher/k3s/agent/containerd/io.containerd.content.v1.content/ingest
Until the fix is released and gets rolled out into distributions, Iโll share my workaround. Seems like the
ctr
command line is able to fix broken image cache layers. So you can run something like:Hi @fuweid :
I just used the proxy to simulate the issue which happens somewhere in our network (somebody closing the tcp connection). In our network this seems to always happen if the layer is big and it takes too long to download it from the registry.
So I see 2 possible ways to resolve this:
Iโd be fine with both (of course I would be more happy with resolving both sides) ๐
Thank you for your help ๐