test-infra: E2E test images: httpd images failed to push to staging

What happened:

The Image Builder postsubmit jobs post-kubernetes-push-e2e-httpd-test-images and post-kubernetes-push-e2e-new-httpd-test-images are failing with a 401 Unauthorized error while trying to push to gcr.io/k8s-staging-e2e-test-images.

What you expected to happen:

It should have been able to push the images.

How to reproduce it (as minimally and precisely as possible):

Rerun the jobs.

Please provide links to example occurrences, if any:

[1] https://testgrid.k8s.io/sig-testing-images#post-kubernetes-push-e2e-httpd-test-images [2] https://testgrid.k8s.io/sig-testing-images#post-kubernetes-push-e2e-httpd-new-test-images [3] https://testgrid.k8s.io/sig-testing-images#kubernetes-e2e-windows-servercore-cache

Anything else we need to know?:

Worth noting that the job passed on 2021.02.09, but failed on 2021.02.15. The prow job config is fine, running the k8s-staging-e2e-test-images.sh script that generated the job reveals no diff.

Additionally, on 2021.02.11 the kubernetes-e2e-windows-servercore-cache job passed [3], a job which is similarly defined to the other 2 jobs.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 46 (46 by maintainers)

Most upvoted comments

It seems that it was succesful for the following images:

glusterdynamic-provisioner: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-kubernetes-push-e2e-glusterdynamic-provisioner-test-images/1366557273958125568 httpd: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-kubernetes-push-e2e-httpd-test-images/1366557274016845824 nginx: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-kubernetes-push-e2e-nginx-test-images/1366557274096537600 nginx-new: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-kubernetes-push-e2e-nginx-test-images/1366557274096537600 perl: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-kubernetes-push-e2e-perl-test-images/1366557274436276224

It seems that it failed for:

busybox: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-kubernetes-push-e2e-busybox-test-images/1366557273911988224 httpd-new: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-kubernetes-push-e2e-httpd-test-images/1366557274016845824

It seems that for the httpd-new image, it generated the same SHA:

 #5 exporting to image
#5 exporting layers done
#5 exporting manifest sha256:00c98fa2a69a9f208dd7238549587df778f584ea4d56a0deb977c4ae4b1f6352 0.0s done
#5 exporting config sha256:6b84c32240b147c355e6400fc17e7bb7111cf6687ced6e75a3ad3ff1b2aaf8dc done
#5 pushing layers 0.0s done
#5 pushing manifest for gcr.io/k8s-staging-e2e-test-images/httpd:2.4.39-1-linux-amd64
#5 pushing manifest for gcr.io/k8s-staging-e2e-test-images/httpd:2.4.39-1-linux-amd64 0.2s done
#5 ERROR: failed commit on ref "manifest-sha256:00c98fa2a69a9f208dd7238549587df778f584ea4d56a0deb977c4ae4b1f6352": unexpected status: 401 Unauthorized 

This sha already exists:

docker pull gcr.io/k8s-staging-e2e-test-images/httpd:2.4.39-alpine-linux-amd64
2.4.39-alpine-linux-amd64: Pulling from k8s-staging-e2e-test-images/httpd
050382585609: Pull complete
231e07644ff6: Pull complete
1eeb2e94cd0a: Pull complete
8cc6fadd0bd1: Pull complete
5f4e88dd5259: Pull complete
Digest: sha256:00c98fa2a69a9f208dd7238549587df778f584ea4d56a0deb977c4ae4b1f6352
Status: Downloaded newer image for gcr.io/k8s-staging-e2e-test-images/httpd:2.4.39-alpine-linux-amd64

Somehow, I missed adding the label to the Dockerfile… oups! Sent a PR here: https://github.com/kubernetes/kubernetes/pull/99631

For the busybox image, it failed on the Windows images, I’ll take a look at that. But the idea worked. 😃

I have been looking into it for a bit. So, the --output=type=registry is the only viable solution for us, considering we also want the Windows images there. It is a known fact that you simply cannot have Windows images in a “normal docker” on a Linux:

Currently, multi-platform images cannot be exported with the docker export type. The most common usecase for multi-platform images is to directly push to a registry (see registry).

From my experiments is that docker buildx build --output=type=docker the other Linux images can actually be referenced locally through docker images and so on, but not the Windows images. Trying to build the Windows images with docker buildx and the output type=docker, we can see that it actually tries to import into Docker images:

...

#7 [stage-2 2/2] COPY --from=nginx-source /nginx /usr/share/nginx
#7 sha256:976b1d0a846ad8f0a65631c26f8a46b2cf6e0ab117a8375e4ed5ee28a25d5e98
#7 DONE 0.1s

#8 exporting to oci image format
#8 sha256:69a2560eef4d3ece902c3b5149d142e9bd132f25db0bc9e35b94201534c415d2
#8 exporting layers
#8 exporting layers 0.5s done
#8 exporting manifest sha256:a175d01afdf166f7e46d7bc9d476ec43a4168426ff44fc2f3d4b8b6868f78756 0.0s done
#8 exporting config sha256:c106fccad57d1974e08505450c0203aac62d6d081de621f2f5b84534d0db8da6 done
#8 sending tarball
#8 sending tarball 4.7s done
#8 DONE 5.2s

#9 importing to docker
#9 sha256:8026238f638c17bcfbf7d41b163286c4f4f6ef72fb77b67cec9833d16ae7a9fb
#9 DONE 0.0s
/workspace/kubernetes/test/images

There’s no error printed, but it doesn’t end up in docker images. I’ve also tried to build the Windows images with output type=oci, which generated a .tar file. You can actually import it locally with docker import image.tar, but inspecting the imported image [1], you can see that it sets the os/arch type to linux/amd64, which is not quite right. Other than that, it can be seen that the User, env variables (including the PATH) are stripped away, which is problematic. Trying to import the image with --platform "windows/amd64" we can see:

docker import --platform "windows/amd64" claudiubelu-nginx-1.14-1-windows-amd64-20H2.tar claudiubelu/nginx:1.14-1-windows-amd64-20h2
Error response from daemon: operating system is not supported

Which confirms that docker buildx build --output=type=docker probably encountered the same issue. Furthermore, having Windows images on a Linux node has been a pretty frequent question. I this this covers the reason nicely [2].

So, I still think --output=type=registry is the way to go. --output=type=oci or tar could work, but then we’d have to push them to the registry ourselves, which is supposed to be docker’s / buildx’s purpose in the first place.

[1] https://paste.ubuntu.com/p/m2tmX3yGMW/ [2] https://forums.docker.com/t/docker-daemon-on-ubuntu-pull-windows-containers-or-create-my-own/28823/6

docker/buildx#327 looks like what we’re experiencing

* Cause is present in 1.4.3 (which is what `docker version` is dumping for this job) ([containerd/containerd#4622](https://github.com/containerd/containerd/issues/4622))

* Fix landed for containerd master 2020-12-16 ([containerd/containerd#4854](https://github.com/containerd/containerd/pull/4854))

* Was cherry-picked back to release/1.4 2020-01-14 ([containerd/containerd#4942](https://github.com/containerd/containerd/pull/4942))

* PR to pull into buildkit still in draft ([moby/buildkit#1921](https://github.com/moby/buildkit/pull/1921))

This seems to have affected other registries such as quay.io, ghcr.io, etc. It looks like google artifact registry may be working? But migrating to that may be non-trivial (ref: kubernetes/k8s.io#1343)

Oh. That might be it. It might be because the exact same hash is being pushed.

I have been looking into it for a bit. So, the --output=type=registry is the only viable solution for us, considering we also want the Windows images there. It is a known fact that you simply cannot have Windows images in a “normal docker” on a Linux

docker/buildx#327 looks like what we’re experiencing

* Cause is present in 1.4.3 (which is what `docker version` is dumping for this job) ([containerd/containerd#4622](https://github.com/containerd/containerd/issues/4622))

* Fix landed for containerd master 2020-12-16 ([containerd/containerd#4854](https://github.com/containerd/containerd/pull/4854))

* Was cherry-picked back to release/1.4 2020-01-14 ([containerd/containerd#4942](https://github.com/containerd/containerd/pull/4942))

* PR to pull into buildkit still in draft ([moby/buildkit#1921](https://github.com/moby/buildkit/pull/1921))

This seems to have affected other registries such as quay.io, ghcr.io, etc. It looks like google artifact registry may be working? But migrating to that may be non-trivial (ref: kubernetes/k8s.io#1343)

Oh. That might be it.

Using source file from “good” build (gs://k8s-staging-e2e-test-images-gcb/source/1612863662.49-29f09d2c41c5417f952f03585182c7aa.tgz)

Manually submitting with no changes works

Changing to push nginx with new version

vi test/images/nginx/VERSION
# submit with _WHAT=nginx

yields a similar error

#5 exporting to image
#5 exporting layers done
#5 exporting manifest sha256:a2d0ea7d3550b0853d04263025e6cfcc353f3e102fe725d19b9fc51282603f02 0.0s done
#5 exporting config sha256:f4ac389d78cd8be151962a9fc7227b3b23862d9040eeb0686158a3229da60022 0.0s done
#5 pushing layers 0.1s done
#5 pushing manifest for gcr.io/k8s-staging-e2e-test-images/nginx:1.14-monkeys-linux-amd64
#5 pushing manifest for gcr.io/k8s-staging-e2e-test-images/nginx:1.14-monkeys-linux-amd64 0.3s done
#5 ERROR: failed commit on ref "manifest-sha256:a2d0ea7d3550b0853d04263025e6cfcc353f3e102fe725d19b9fc51282603f02": unexpected status: 401 Unauthorized
------
 > exporting to image:
------
failed to solve: rpc error: code = Unknown desc = failed commit on ref "manifest-sha256:a2d0ea7d3550b0853d04263025e6cfcc353f3e102fe725d19b9fc51282603f02": unexpected status: 401 Unauthorized
make: *** [Makefile:43: all-build-and-push] Error 1

so… new tag, unchanged manifest = error… is this working as intended, or should we be allowing this?

Right, gcr.io/k8s-staging-e2e-test-images/nginx:1.14-monkeys-linux-amd64 and gcr.io/k8s-staging-e2e-test-images/nginx:1.14-alpine-linux-amd64 has the same sha, which is identical to nginx:1.14-alpine since the image was mirrored. I had the same sha on my own registry as well. So, it should work if we generate a new sha then. We could make a trivial change in the Dockerfile:

diff --git a/test/images/nginx/Dockerfile b/test/images/nginx/Dockerfile
index 3983b7c4f24..9a0c8ffb425 100644
--- a/test/images/nginx/Dockerfile
+++ b/test/images/nginx/Dockerfile
@@ -15,3 +15,5 @@
 # NOTE(claudiub): Noop. We're just mirroring the image to staging.
 ARG BASEIMAGE
 FROM $BASEIMAGE
+
+LABEL image_version="1.14-1"

Building this, I then get the sha:

docker pull claudiubelu/nginx:1.14-1-linux-amd64
1.14-1-linux-amd64: Pulling from claudiubelu/nginx
bdf0201b3a05: Already exists
3d0a573c81ed: Already exists
8129faeb2eb6: Already exists
3dc99f571daf: Already exists
Digest: sha256:ebf4de42b3d660133f6f7d0feddabe31a44d07ed55f59471fd2072b0d8e8afae

Which is now different from the previous a2d0ea7d3550b0853d04263025e6cfcc353f3e102fe725d19b9fc51282603f02. Being a different sha, it should be pushable. If we look at the prow job history, we’d see that the httpd and nginx inage jobs worked exactly once: when the images dockerhub images were mirrored for the first time. This doesn’t affect pushing images to dockerhub, I would have encountered this issue too before.

IMO, we could go ahead with this fix, making a note in the README.md too, and push on the fix to merge on docker.