buildx: Pushing a multi-platform image to ghcr.io results in an endless loop

If you build an image for multiple CPU architectures at the same time and use --push, the upload of the images will often get stuck in an endless loop.

The following line is printed over and over again:

error: failed to copy: failed to do request: Put "https://ghcr.io/v2/reconman/example-buildx-push/blobs/upload/a5521203-2c8d-49d5-bcde-d9ba8500a5b0?digest=sha256%3A1e1235e447358303a2d2975f6078eb4f82db3b64fe1ef840976f6033eac9a19f": write tcp 172.17.0.2:40356->140.82.113.33:443: write: connection reset by peer

I’m able to easily reproduce the issue by building a python-based image with all architectures allowed by the base image: https://github.com/reconman/example-buildx-push

I increased the number of layers by adding some RUN commands because I’m suspecting that it increases the failure chance.

When I changed --push to type=oci,dest=/tmp/image.tar and ran the following containerd commands manually, I encountered https://github.com/containerd/containerd/issues/2706, so it may be related to that?

sudo ctr i import --base-name ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }} --digests --all-platforms /tmp/image.tar
while IFS= read -r line; do
  sudo ctr i push --user "${{ github.actor }}:${{ secrets.GITHUB_TOKEN }}" $line;
done <<< "${{ steps.meta.outputs.tags }}"

Here are the Github workflow logs with the Buildkit debug flag enabled: logs_1.zip

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 17 (4 by maintainers)

Commits related to this issue

Most upvoted comments

I’m observing those hangs myself, they are random, and restarting the build again and again will make it work eventually. Using v0.9.1 as suggested seems to have fixed it, but might have just been a fluke.

I’m not even doing anything multi-arch related.

Suggest to try again

Right now it’s a game of luck. I spent a few hours retrying my workflow for one of my repos where I was building 2 Docker images like this and each Docker build job takes 20 minutes.

With an estimated 50 % success rate for each Dockerfile, the chance of both succeeding was 25 %.

Each time, I had to first wait 20 minutes for the build to finish and then check if the job is stuck or not. If it was stuck, I had to cancel the workflow and start the 20 minute build again.

The probability of failure increases with the number of buildx jobs in the workflow. If you copy the build job in the example a couple of times, the workflow success rate will drop to below 10 %. You can’t retry jobs afaik, only workflows. A workaround for this would be to create different workflows for each Dockerfile, but that’s not an optimal solution.