buildkit: Docker BuildKit caching w/ --cache-from fails every second time, except when using `docker-container`

Similar to https://github.com/moby/buildkit/issues/1981, but it’s still happening with 20.10.7, and I have a minimal reproduction case.

Version information

  • Macbook Air (M1, 2020)
  • Mac OS Big Sur 11.4
  • Docker Desktop 3.5.2 (66501)
% docker version
Client:
 Cloud integration: 1.0.17
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.16.4
 Git commit:        f0df350
 Built:             Wed Jun  2 11:56:23 2021
 OS/Arch:           darwin/arm64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       b0f5bc3
  Built:            Wed Jun  2 11:55:36 2021
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          1.4.6
  GitCommit:        d71fcd7d8303cbf684402823e425e9dd2e99285d
 runc:
  Version:          1.0.0-rc95
  GitCommit:        b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Steps to reproduce

Have this Dockerfile:

# syntax=docker/dockerfile:1
FROM debian:buster-slim
RUN yes | head -20 | tee /yes.txt
COPY . /app

Run this script:

#!/bin/bash
set -euo pipefail
DOCKER_BUILDKIT=1
docker system prune -a -f
docker build \
    -t circularly/docker-cache-issue-20210722:cachebug \
    --cache-from circularly/docker-cache-issue-20210722:cachebug \
    --build-arg BUILDKIT_INLINE_CACHE=1 \
    .
docker push circularly/docker-cache-issue-20210722:cachebug
# this causes a change in the local files to simulate a code-only change
date > date_log.txt

(also here: https://github.com/jli/docker-cache-issue-20210722 )

What I see: When I run the above script multiple times, it alternates every time whether the RUN yes | head -20 | tee /yes.txt step is cached or not. The docker build output alternates between:

  • => [2/3] RUN yes | head -20 | tee /yes.txt
  • => CACHED [2/3] RUN yes | head -20 | tee /yes.txt

With docker-container driver

This comment by @tonistiigi suggested to use the “container driver”. This does seem to work! I tried replacing the docker build command from above with this:

docker buildx create --driver docker-container --name cache-bug-workaround
docker buildx build --builder cache-bug-workaround --load \
    -t circularly/docker-cache-issue-20210722:cachebug-containerdriver \
    --cache-from circularly/docker-cache-issue-20210722:cachebug-containerdriver \
    --build-arg BUILDKIT_INLINE_CACHE=1 \
    .
docker buildx rm --builder cache-bug-workaround

This consistently results in the RUN yes ... step being cached!

The problem is that docker buildx doesn’t appear to a subcommand in the https://hub.docker.com/_/docker image, which is what we use in CI. Is there a way to use the container driver when using that image?

Could you help me understand why this is needed? Will this be fixed with a future release?

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 31
  • Comments: 15 (1 by maintainers)

Commits related to this issue

Most upvoted comments

Any update on this? This seems like a major issue, and the alternative of using docker-container is untenable due to those issues noted above.

I opened #1981 and I can confirm that my reproducible example also still does not work

Based on my limited testing, using docker pull <version> for every image used in --cache-from arguments will suppress this bug.

This was noted as a workaround in #1981, but may not always work, based on the comment above. We’re using Bitbucket Pipelines (regular runner, not the self-hosted ones), which means no access to buildx, limited Docker updates, and x86-only builds - any one of which might affect the viability of this workaround.

As a side note, docker pull <tag> || true can be used in pipeline steps where you’re not sure if the image exists.

Same issue here.

Same problem when building from inside of docker:20.10.8-dind

Our team is facing this same issue recently in github action since its latest runner image updated docker version to v23+ which uses BuildKit as default build engine.

Our original cache flow is:

  • pull same commit sha tag || pull latest tag
  • build with --cache-from <same commit sha> --cache-from <latest>
  • tag the new image as <commit sha> & latest
  • push both tags

And with this flow we have the exact same issue that --cache-from fails every second time.


Tried pulling all image tags beforehand but it is not helpful. Based on my observations, it seems that

  • If an image is built from scratch, then this image CAN be used as cache
  • Otherwise if an image is built using a cache image, then it CANNOT be used as cache
  • This matches the strange caching behaviour because we always push the new built image as latest then build from it the next time.

So our current workaround is to add a specific ci step

  • say if the trigger branch is deployment/build_cache it will build image from SCRATCH and push with tag <image>:build_cache
  • all other trigger branch will build with --cache-from <image>:build_cache instead of latest tag
  • and if dockerfile is changed (eg. base image updated) then we push deployment/build_cache once to update the cache
  • so far the caching behaviour is now more consistent

This ended up being enough of a drag on my team’s productivity that we came up with a workaround that we’ve been using for about a month that has been working really well for us so for.

We split out a “base” Docker image which installs all our dependencies, and then we have a “final” Docker image which just copies the code on top of the base image as a final layer.

The important part is that these are distinct images and not just separate layers, which is how we work around the inconsistent layer caching behavior.

Our “final” Dockerfile just looks like:

FROM container-host.com/your-project/your-base-image:latest-version
COPY . /app

Downside: This setup makes it harder to test changes to the base image. Instead of just updating a single Dockerfile and building+pushing, you need to (1) change the “base” Dockerfile/dependencies, (2) build and push the base image to your container host with a new tag for testing, (3) edit the “final” Dockerfile to reference the new testing tag. I wrote a Python script to do 2+3 so testing of changes to our base image is pretty streamlined still. Note: It would be some more work to make this fully integrated with CI such that the base image used in prod is also built in CI. currently, we just use the base images built on local machines from when people make changes. This is acceptable to us, but maybe some people have more stringent requirements.

Overall, this has definitely been worth it for us, especially since our base image is huge (3GB of Python ML dependencies) and takes a long time to build, so cache misses were extremely painful.

  • docker build for code-only changes are guaranteed to only copy the code layer.
  • docker push for the new code-only layers is also guaranteed to be fast (when the cache would break for base layers before, people would have to upload 3GB of data, sometimes over spotty WiFi or while tethering)
  • everyone is guaranteed to share the expensive central base image. new team members or people who’ve pruned their cache just need to download the base image instead of building their own local copy (docker pull never worked for this, in my experience)
  • building our Docker image in CI is guaranteed to be fast, and also much simpler since we no longer have a bunch of verbose --cache-from flags and extra docker push calls to get caching in CI builds. (Though see note above about fully integrating this process in CI)

Two issues I’m noticing with using the docker-container driver to work around the caching issue:

  1. It adds some export/import steps
  2. docker push seems to be pushing all layers?

With the default driver, rebuilds of code-only changes take ~1 minute (when I get proper caching of the expensive layers in my image). With the docker-container driver, these 2 factors mean rebuilds after code-only changes take ~4-5 minutes.

export/import steps

#25 exporting to oci image format
#25 exporting layers done
#25 exporting manifest sha256:01230f6377dec5a6988c924373bb62afe2837d3afa7bb0e84e98a016481c1c81 done
#25 exporting config sha256:4f48d81bc559f074600e3088949591f885d4ef3c74b8d833408864b6bd013df4 done
#25 sending tarball
#25 ...

#26 importing to docker
#26 DONE 32.1s

#25 exporting to oci image format
#25 sending tarball 43.0s done
#25 DONE 43.0s

This seems to add an extra minute to the build. I’m working with large images (~3.5gb from various scientific Python libraries), which I’m guessing exacerbates this issue.

docker push issue

Pushing my 3.5gb image takes ~3 minutes.

It seems that with the docker-container driver, docker push isn’t able to see that the expensive layers are shared, and it’s pushing all the layers instead of only pushing the new layers? I’m guessing this based on the output from docker push not saying “Layer already exists”:

6474dc186dfd: Preparing
2d80b2e557e9: Preparing
59149f33a870: Preparing
ed04f21afbe5: Preparing
c9ec67fe6421: Preparing
e42dc4266416: Preparing
a55e5a0e7c4a: Preparing
aef13dfbb6f9: Preparing
1e602bec2da5: Preparing
b1c4e3f331ea: Preparing
3fdf9f44ae06: Preparing
78ce42cd87aa: Preparing
82e21ae59256: Preparing
02c055ef67f5: Preparing
e42dc4266416: Waiting
3fdf9f44ae06: Waiting
a55e5a0e7c4a: Waiting
78ce42cd87aa: Waiting
aef13dfbb6f9: Waiting
1e602bec2da5: Waiting
82e21ae59256: Waiting
b1c4e3f331ea: Waiting
02c055ef67f5: Waiting
ed04f21afbe5: Pushed
59149f33a870: Pushed
c9ec67fe6421: Pushed
2d80b2e557e9: Pushed
aef13dfbb6f9: Pushed
1e602bec2da5: Pushed
b1c4e3f331ea: Pushed
3fdf9f44ae06: Pushed
6474dc186dfd: Pushed
82e21ae59256: Pushed
78ce42cd87aa: Pushed
02c055ef67f5: Pushed
e42dc4266416: Pushed
a55e5a0e7c4a: Pushed

I push several tags. The first push takes 3 minutes, and the rest of the tags finish quickly as they all say “Layer already exists” for all the layers.

Same issue here (Debian Bullseye)

$ docker --version
Docker version 20.10.14, build a224086