buildx: Pushing a multi-platform image to ghcr.io results in an endless loop
If you build an image for multiple CPU architectures at the same time and use --push
, the upload of the images will often get stuck in an endless loop.
The following line is printed over and over again:
error: failed to copy: failed to do request: Put "https://ghcr.io/v2/reconman/example-buildx-push/blobs/upload/a5521203-2c8d-49d5-bcde-d9ba8500a5b0?digest=sha256%3A1e1235e447358303a2d2975f6078eb4f82db3b64fe1ef840976f6033eac9a19f": write tcp 172.17.0.2:40356->140.82.113.33:443: write: connection reset by peer
I’m able to easily reproduce the issue by building a python-based image with all architectures allowed by the base image: https://github.com/reconman/example-buildx-push
I increased the number of layers by adding some RUN
commands because I’m suspecting that it increases the failure chance.
When I changed --push
to type=oci,dest=/tmp/image.tar
and ran the following containerd
commands manually, I encountered https://github.com/containerd/containerd/issues/2706, so it may be related to that?
sudo ctr i import --base-name ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }} --digests --all-platforms /tmp/image.tar
while IFS= read -r line; do
sudo ctr i push --user "${{ github.actor }}:${{ secrets.GITHUB_TOKEN }}" $line;
done <<< "${{ steps.meta.outputs.tags }}"
Here are the Github workflow logs with the Buildkit debug flag enabled: logs_1.zip
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 2
- Comments: 17 (4 by maintainers)
Commits related to this issue
- Pin to buildkit 0.9.1 in an attempt to mitigate problems https://github.com/docker/buildx/issues/834#issuecomment-965730742 — committed to rust-lang/rust-playground by shepmaster 3 years ago
- 📌 CI: pin to buildkit 0.9.1 🛺 in an attempt to mitigate connection problems moby/buildkit#2453 docker/buildx#834 Signed-off-by: nanake <nanake@users.noreply.github.com> — committed to nanake/ffmpeg-tinderbox by nanake 3 years ago
- fix: endless loop in build-image workflow by meantime solution: https://github.com/docker/buildx/issues/834#issuecomment-965730742 — committed to sksat/papermc-docker by sksat 3 years ago
- fix: endless loop in build-image workflow by meantime solution: https://github.com/docker/buildx/issues/834#issuecomment-965730742 — committed to sksat/papermc-docker by sksat 3 years ago
- Pin to buildkit 0.9.1 https://github.com/docker/build-push-action/issues/498#issuecomment-967773178 https://github.com/docker/buildx/issues/834 https://github.com/moby/buildkit/pull/2461 Looks like ... — committed to ThePalaceProject/circulation by jonathangreen 3 years ago
- Workaround for docker buildx push issues https://github.com/docker/buildx/issues/834#issuecomment-965730742 — committed to mobiledgex/go-swagger by venkytv 2 years ago
- Workaround for docker buildx push issues (#3) https://github.com/docker/buildx/issues/834#issuecomment-965730742 — committed to mobiledgex/go-swagger by venkytv 2 years ago
- Workaround for intermittent docker push issues See https://github.com/docker/buildx/issues/834#issuecomment-965730742 for more details. — committed to mobiledgex/edge-cloud-monorepo by venkytv 2 years ago
- Use suggested fix in from docker issue #834 https://github.com/docker/buildx/issues/834#issuecomment-965730742 — committed to brightbox/container-registry-write-test by johnl 2 years ago
- Configure buildx to use older buildkit Rolling back to previous buildkit version to see if this fixes the issue. See: - https://github.com/docker/buildx/issues/834#issuecomment-965730742 — committed to felddy/foundryvtt-docker by felddy 2 years ago
I’m observing those hangs myself, they are random, and restarting the build again and again will make it work eventually. Using v0.9.1 as suggested seems to have fixed it, but might have just been a fluke.
I’m not even doing anything multi-arch related.
Right now it’s a game of luck. I spent a few hours retrying my workflow for one of my repos where I was building 2 Docker images like this and each Docker build job takes 20 minutes.
With an estimated 50 % success rate for each Dockerfile, the chance of both succeeding was 25 %.
Each time, I had to first wait 20 minutes for the build to finish and then check if the job is stuck or not. If it was stuck, I had to cancel the workflow and start the 20 minute build again.
The probability of failure increases with the number of buildx jobs in the workflow. If you copy the build job in the example a couple of times, the workflow success rate will drop to below 10 %. You can’t retry jobs afaik, only workflows. A workaround for this would be to create different workflows for each Dockerfile, but that’s not an optimal solution.