buildkit: Docker BuildKit caching w/ --cache-from fails every second time, except when using `docker-container`
Similar to https://github.com/moby/buildkit/issues/1981, but it’s still happening with 20.10.7, and I have a minimal reproduction case.
Version information
- Macbook Air (M1, 2020)
- Mac OS Big Sur 11.4
- Docker Desktop 3.5.2 (66501)
% docker version
Client:
Cloud integration: 1.0.17
Version: 20.10.7
API version: 1.41
Go version: go1.16.4
Git commit: f0df350
Built: Wed Jun 2 11:56:23 2021
OS/Arch: darwin/arm64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.7
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: b0f5bc3
Built: Wed Jun 2 11:55:36 2021
OS/Arch: linux/arm64
Experimental: false
containerd:
Version: 1.4.6
GitCommit: d71fcd7d8303cbf684402823e425e9dd2e99285d
runc:
Version: 1.0.0-rc95
GitCommit: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Steps to reproduce
Have this Dockerfile:
# syntax=docker/dockerfile:1
FROM debian:buster-slim
RUN yes | head -20 | tee /yes.txt
COPY . /app
Run this script:
#!/bin/bash
set -euo pipefail
DOCKER_BUILDKIT=1
docker system prune -a -f
docker build \
-t circularly/docker-cache-issue-20210722:cachebug \
--cache-from circularly/docker-cache-issue-20210722:cachebug \
--build-arg BUILDKIT_INLINE_CACHE=1 \
.
docker push circularly/docker-cache-issue-20210722:cachebug
# this causes a change in the local files to simulate a code-only change
date > date_log.txt
(also here: https://github.com/jli/docker-cache-issue-20210722 )
What I see: When I run the above script multiple times, it alternates every time whether the RUN yes | head -20 | tee /yes.txt step is cached or not. The docker build output alternates between:
=> [2/3] RUN yes | head -20 | tee /yes.txt=> CACHED [2/3] RUN yes | head -20 | tee /yes.txt
With docker-container driver
This comment by @tonistiigi suggested to use the “container driver”. This does seem to work! I tried replacing the docker build command from above with this:
docker buildx create --driver docker-container --name cache-bug-workaround
docker buildx build --builder cache-bug-workaround --load \
-t circularly/docker-cache-issue-20210722:cachebug-containerdriver \
--cache-from circularly/docker-cache-issue-20210722:cachebug-containerdriver \
--build-arg BUILDKIT_INLINE_CACHE=1 \
.
docker buildx rm --builder cache-bug-workaround
This consistently results in the RUN yes ... step being cached!
The problem is that docker buildx doesn’t appear to a subcommand in the https://hub.docker.com/_/docker image, which is what we use in CI. Is there a way to use the container driver when using that image?
Could you help me understand why this is needed? Will this be fixed with a future release?
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 31
- Comments: 15 (1 by maintainers)
Commits related to this issue
- Fix caching a docker build of lotus-test It turned out that using `DOCKER_BUILDKIT=1` has a problem with caching: https://github.com/moby/buildkit/issues/2274. Using `docker buildx` would fix it, but... — committed to airenas/boost by airenas 2 years ago
- Consolidate makefiles (#785) * Consolidate makefiles - Move docker building stuff to the main makefile - Drop internal makefiles - Allow to build lotus from source - Update readme * Fix cach... — committed to filecoin-project/boost by airenas 2 years ago
Any update on this? This seems like a major issue, and the alternative of using
docker-containeris untenable due to those issues noted above.I opened #1981 and I can confirm that my reproducible example also still does not work
Based on my limited testing, using
docker pull <version>for every image used in--cache-fromarguments will suppress this bug.This was noted as a workaround in #1981, but may not always work, based on the comment above. We’re using Bitbucket Pipelines (regular runner, not the self-hosted ones), which means no access to
buildx, limited Docker updates, and x86-only builds - any one of which might affect the viability of this workaround.As a side note,
docker pull <tag> || truecan be used in pipeline steps where you’re not sure if the image exists.Same issue here.
Same problem when building from inside of docker:20.10.8-dind
Our team is facing this same issue recently in github action since its latest runner image updated docker version to v23+ which uses BuildKit as default build engine.
Our original cache flow is:
--cache-from <same commit sha> --cache-from <latest><commit sha>&latestAnd with this flow we have the exact same issue that
--cache-from fails every second time.Tried pulling all image tags beforehand but it is not helpful. Based on my observations, it seems that
So our current workaround is to add a specific ci step
<image>:build_cache--cache-from <image>:build_cacheinstead of latest tagThis ended up being enough of a drag on my team’s productivity that we came up with a workaround that we’ve been using for about a month that has been working really well for us so for.
We split out a “base” Docker image which installs all our dependencies, and then we have a “final” Docker image which just copies the code on top of the base image as a final layer.
The important part is that these are distinct images and not just separate layers, which is how we work around the inconsistent layer caching behavior.
Our “final” Dockerfile just looks like:
Downside: This setup makes it harder to test changes to the base image. Instead of just updating a single Dockerfile and building+pushing, you need to (1) change the “base” Dockerfile/dependencies, (2) build and push the base image to your container host with a new tag for testing, (3) edit the “final” Dockerfile to reference the new testing tag. I wrote a Python script to do 2+3 so testing of changes to our base image is pretty streamlined still. Note: It would be some more work to make this fully integrated with CI such that the base image used in prod is also built in CI. currently, we just use the base images built on local machines from when people make changes. This is acceptable to us, but maybe some people have more stringent requirements.
Overall, this has definitely been worth it for us, especially since our base image is huge (3GB of Python ML dependencies) and takes a long time to build, so cache misses were extremely painful.
docker buildfor code-only changes are guaranteed to only copy the code layer.docker pushfor the new code-only layers is also guaranteed to be fast (when the cache would break for base layers before, people would have to upload 3GB of data, sometimes over spotty WiFi or while tethering)docker pullnever worked for this, in my experience)--cache-fromflags and extradocker pushcalls to get caching in CI builds. (Though see note above about fully integrating this process in CI)Two issues I’m noticing with using the docker-container driver to work around the caching issue:
docker pushseems to be pushing all layers?With the default driver, rebuilds of code-only changes take ~1 minute (when I get proper caching of the expensive layers in my image). With the docker-container driver, these 2 factors mean rebuilds after code-only changes take ~4-5 minutes.
export/import steps
This seems to add an extra minute to the build. I’m working with large images (~3.5gb from various scientific Python libraries), which I’m guessing exacerbates this issue.
docker pushissuePushing my 3.5gb image takes ~3 minutes.
It seems that with the docker-container driver,
docker pushisn’t able to see that the expensive layers are shared, and it’s pushing all the layers instead of only pushing the new layers? I’m guessing this based on the output fromdocker pushnot saying “Layer already exists”:I push several tags. The first push takes 3 minutes, and the rest of the tags finish quickly as they all say “Layer already exists” for all the layers.
Same issue here (Debian Bullseye)