containerd: Failed to pull image (unexpected commit digest)

Description

A container image sometimes fails to be pulled from a remote repository.

For example, when I made a k8s DaemonSet, 4 Pods were created successfully but 1 was not. When I deleted the error Pod manually, the image was re-pulled successfully, but itโ€™s never re-pulled unless the Pod is deleted manually.

Steps to reproduce the issue:

  1. Download this manifest template.
  2. Replace {{ .unbound }} with quay.io/cybozu/unbound:1.9.5.1 (cf. this image is built with this Dockerfile).
  3. Apply the manifest to a k8s cluster of which workers use containerd as a container runtime.

Describe the results you received:

Sometimes, a Pod falls into the ImagePullBackoff status with the following messages.

$ kubectl -n kube-system describe po node-dns-mnw9k
...
Events:
  Type     Reason       Age                  From                 Message
  ----     ------       ----                 ----                 -------
  Normal   Scheduled    30m                  default-scheduler    Successfully assigned kube-system/node-dns-mnw9k to 10.0.0.101
  Warning  FailedMount  30m (x3 over 30m)    kubelet, 10.0.0.101  MountVolume.SetUp failed for volume "config-volume" : configmap "node-dns" not found
  Normal   Pulling      30m                  kubelet, 10.0.0.101  Pulling image "quay.io/cybozu/unbound:1.9.5.1"
  Warning  Failed       28m (x2 over 28m)    kubelet, 10.0.0.101  Failed to pull image "quay.io/cybozu/unbound:1.9.5.1": rpc error: code = FailedPrecondition desc = failed to pull and unpack image "quay.io/cybozu/unbound:1.9.5.1": failed commit on ref "layer-sha256:35c102085707f703de2d9eaad8752d6fe1b8f02b5d2149f1d8357c9cc7fb7d0a": unexpected commit digest sha256:e3c9baab3234687950948685f582f713f19c85a3f989ef797ead63bb084e25b6, expected sha256:35c102085707f703de2d9eaad8752d6fe1b8f02b5d2149f1d8357c9cc7fb7d0a: failed precondition
  Warning  Failed       28m (x2 over 28m)    kubelet, 10.0.0.101  Error: ErrImagePull

Describe the results you expected:

The image is pulled successfully.

Output of containerd --version:

containerd github.com/containerd/containerd v1.3.2 ff48f57fc83a8c44cf4ad5d672424a98ba37ded6

Any other relevant information:

The same problem was reported and seems to have been solved here about 2 years ago, but recently one developer has reported it again.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 27 (10 by maintainers)

Most upvoted comments

I find a way to solve this temporarily.

Just clean the image layer cache folder in ${containerd folder}/io.containerd.content.v1.content/ingest.

Containerd would not clean this cache automatically when some layer data broken.

https://github.com/containerd/containerd/blob/v1.5.2/content/local/store.go#L503

errors.Wrapf on a nil error will return nil, so this caused the error case where ref is incorrect to be misinterpreted as a success case.

Hi @fuweid ๐Ÿ˜ƒ I was able to reproduce it locally using toxiproxy .

The issue is: the connection is getting closed. I can reproduce this by using toxiproxy to close the connection after X bytes.

Iโ€™m using the following Dockerfile:

FROM ubuntu:20.04

RUN apt-get update && apt-get install -y wget dnsutils
RUN wget -O /usr/local/bin/toxiproxy-cli https://github.com/Shopify/toxiproxy/releases/download/v2.1.4/toxiproxy-cli-linux-amd64
RUN wget -O /usr/local/bin/toxiproxy-server https://github.com/Shopify/toxiproxy/releases/download/v2.1.4/toxiproxy-server-linux-amd64
RUN chmod +x /usr/local/bin/toxiproxy-cli /usr/local/bin/toxiproxy-server

RUN wget -O /tmp/containerd.tar.gz https://github.com/containerd/containerd/releases/download/v1.4.4/containerd-1.4.4-linux-amd64.tar.gz \
  && tar -xvzf /tmp/containerd.tar.gz -C /usr/local/ \
  && rm /tmp/containerd.tar.gz
RUN /usr/bin/mkdir /etc/containerd \
  && containerd config default > /etc/containerd/config.toml
RUN wget https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.21.0/crictl-v1.21.0-linux-amd64.tar.gz \
  && tar zxvf crictl-v1.21.0-linux-amd64.tar.gz -C /usr/local/bin \
  && rm -f crictl-v1.21.0-linux-amd64.tar.gz

Build and start the image:

docker build -t foo .
docker run -ti --privileged --rm foo

Inside the container start Iโ€™m:

  • starting containerd
  • starting toxiproxy
  • configure toxiproxy to close the tcp connection after 1 MB data transfer
  • reconfigure /etc/hosts to pull the image through the tcp proxy (via storage.googleapis.com)
containerd &
toxiproxy-server &

IP=$(dig +short storage.googleapis.com | grep -v googlecode | head -n 1)
toxiproxy-cli create registry -l 127.0.0.1:443 -u "$IP:443"
toxiproxy-cli toxic add registry -t limit_data -a bytes=1048576
echo "127.0.0.1 storage.googleapis.com" >> /etc/hosts

crictl -D pull k8s.gcr.io/etcd:3.4.13-0

I hope this helps ๐Ÿ˜ƒ

Thank you so much! That helped also for me! For reference, in k3s the folder is: /var/lib/rancher/k3s/agent/containerd/io.containerd.content.v1.content/ingest

Until the fix is released and gets rolled out into distributions, Iโ€™ll share my workaround. Seems like the ctr command line is able to fix broken image cache layers. So you can run something like:

ctr i pull docker.io/rancher/library-busybox:1.32.1 >/dev/null
ctr i pull docker.io/rancher/library-traefik:2.4.8 >/dev/null
ctr i pull docker.io/library/postgres:11 >/dev/null

Hi @fuweid :

I just used the proxy to simulate the issue which happens somewhere in our network (somebody closing the tcp connection). In our network this seems to always happen if the layer is big and it takes too long to download it from the registry.

So I see 2 possible ways to resolve this:

  • This works as designed, (at least for my case) the issue here getโ€™s closed and it is only solvable by fixing whoever is closing the tcp connection to not do that, not having any way to workaround
  • Containerd is able to handle this and continue the download (like e.g. wget would be able to)

Iโ€™d be fine with both (of course I would be more happy with resolving both sides) ๐Ÿ˜ƒ

Thank you for your help ๐Ÿ‘