buildx: Pushing a multi-platform image to ghcr.io results in an endless loop

If you build an image for multiple CPU architectures at the same time and use --push, the upload of the images will often get stuck in an endless loop.

The following line is printed over and over again:

error: failed to copy: failed to do request: Put "https://ghcr.io/v2/reconman/example-buildx-push/blobs/upload/a5521203-2c8d-49d5-bcde-d9ba8500a5b0?digest=sha256%3A1e1235e447358303a2d2975f6078eb4f82db3b64fe1ef840976f6033eac9a19f": write tcp 172.17.0.2:40356->140.82.113.33:443: write: connection reset by peer

I’m able to easily reproduce the issue by building a python-based image with all architectures allowed by the base image: https://github.com/reconman/example-buildx-push

I increased the number of layers by adding some RUN commands because I’m suspecting that it increases the failure chance.

When I changed --push to type=oci,dest=/tmp/image.tar and ran the following containerd commands manually, I encountered https://github.com/containerd/containerd/issues/2706, so it may be related to that?

sudo ctr i import --base-name ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }} --digests --all-platforms /tmp/image.tar
while IFS= read -r line; do
  sudo ctr i push --user "${{ github.actor }}:${{ secrets.GITHUB_TOKEN }}" $line;
done <<< "${{ steps.meta.outputs.tags }}"

Here are the Github workflow logs with the Buildkit debug flag enabled: logs_1.zip

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 17 (4 by maintainers)

Commits related to this issue

Pin to buildkit 0.9.1 in an attempt to mitigate problems https://github.com/docker/buildx/issues/834#issuecomment-965730742 — committed to rust-lang/rust-playground by shepmaster 3 years ago
📌 CI: pin to buildkit 0.9.1 🛺 in an attempt to mitigate connection problems moby/buildkit#2453 docker/buildx#834 Signed-off-by: nanake <nanake@users.noreply.github.com> — committed to nanake/ffmpeg-tinderbox by nanake 3 years ago
fix: endless loop in build-image workflow by meantime solution: https://github.com/docker/buildx/issues/834#issuecomment-965730742 — committed to sksat/papermc-docker by sksat 3 years ago
fix: endless loop in build-image workflow by meantime solution: https://github.com/docker/buildx/issues/834#issuecomment-965730742 — committed to sksat/papermc-docker by sksat 3 years ago
Pin to buildkit 0.9.1 https://github.com/docker/build-push-action/issues/498#issuecomment-967773178 https://github.com/docker/buildx/issues/834 https://github.com/moby/buildkit/pull/2461 Looks like ... — committed to ThePalaceProject/circulation by jonathangreen 3 years ago
Workaround for docker buildx push issues https://github.com/docker/buildx/issues/834#issuecomment-965730742 — committed to mobiledgex/go-swagger by venkytv 2 years ago
Workaround for docker buildx push issues (#3) https://github.com/docker/buildx/issues/834#issuecomment-965730742 — committed to mobiledgex/go-swagger by venkytv 2 years ago
Workaround for intermittent docker push issues See https://github.com/docker/buildx/issues/834#issuecomment-965730742 for more details. — committed to mobiledgex/edge-cloud-monorepo by venkytv 2 years ago
Use suggested fix in from docker issue #834 https://github.com/docker/buildx/issues/834#issuecomment-965730742 — committed to brightbox/container-registry-write-test by johnl 2 years ago
Configure buildx to use older buildkit Rolling back to previous buildkit version to see if this fixes the issue. See: - https://github.com/docker/buildx/issues/834#issuecomment-965730742 — committed to felddy/foundryvtt-docker by felddy 2 years ago

Most upvoted comments

I’m observing those hangs myself, they are random, and restarting the build again and again will make it work eventually. Using v0.9.1 as suggested seems to have fixed it, but might have just been a fluke.

I’m not even doing anything multi-arch related.

BtbN on Nov 14, 2021

Suggest to try again

Right now it’s a game of luck. I spent a few hours retrying my workflow for one of my repos where I was building 2 Docker images like this and each Docker build job takes 20 minutes.

With an estimated 50 % success rate for each Dockerfile, the chance of both succeeding was 25 %.

Each time, I had to first wait 20 minutes for the build to finish and then check if the job is stuck or not. If it was stuck, I had to cancel the workflow and start the 20 minute build again.

The probability of failure increases with the number of buildx jobs in the workflow. If you copy the build job in the example a couple of times, the workflow success rate will drop to below 10 %. You can’t retry jobs afaik, only workflows. A workaround for this would be to create different workflows for each Dockerfile, but that’s not an optimal solution.

reconman on Nov 10, 2021