buildkit: Docker BuildKit caching w/ --cache-from fails (roundly 50% rate), even when using docker-container

Version information:

  • OS: Ubuntu 20.04.2 LTS, x86_64
  • Docker Server: 20.10.7
  • Docker Client: 20.10.7
  • docker buildx version: v0.5.1-docker
  • BuildKit: moby/buildkit:v0.9.0

Steps to reproduce

Create Dockerfile:

FROM busybox AS stage-1
RUN echo "Hello, world!!!"
COPY changed.txt /opt/changed.txt

FROM busybox
COPY --from=stage-1 /opt/changed.txt /opt/changed.txt

Run script like (REGISTRY shoud be replaced by actual value):

#!/bin/bash
# Recreate builder for clear local cache
docker buildx rm cachebug || true
docker buildx create --name cachebug --driver docker-container
docker buildx inspect cachebug --bootstrap

# Create some changed file
date > changed.txt

# Run 
REGISTRY=registry.example.net/test-docker/example
docker buildx build \
    --builder cachebug \
    --push \
    --tag $REGISTRY:latest \
    --cache-from type=registry,ref=$REGISTRY:buildcache \
    --cache-to type=registry,ref=$REGISTRY:buildcache,mode=max \
    --platform linux/amd64 \
    --platform linux/arm64 \
    .

What I see: When I run the above script multiple times, step RUN echo "Hello, world!!!" fails cache roundly every second time for one of platform (I have not seen the problem with the cache at the same time on all platforms).

For example:

 => CACHED [linux/arm64 stage-1 2/3] RUN echo "Hello, world!!!"                                      0.3s
 => => sha256:e2f4ee50b555089a69b84af6621283565af19e3bcf0596b36ba5feec7b96d1d7 116B / 116B           0.2s
 => => sha256:38cc3b49dbab817c9404b9a301d1f673d4b0c2e3497dbcfbea2be77516679682 820.69kB / 820.69kB   0.6s
 => => extracting sha256:38cc3b49dbab817c9404b9a301d1f673d4b0c2e3497dbcfbea2be77516679682            0.1s
 => => extracting sha256:e2f4ee50b555089a69b84af6621283565af19e3bcf0596b36ba5feec7b96d1d7            0.1s
 => [linux/amd64 stage-1 2/3] RUN echo "Hello, world!!!"                                             0.3s
 => [linux/amd64 stage-1 3/3] COPY changed.txt /opt/changed.txt                                      0.2s
 => [linux/arm64 stage-1 3/3] COPY changed.txt /opt/changed.txt                                      0.2s

Update (2021-08-18)

Repository to reproduce issue: https://github.com/bozaro/buildkit-2279 (simply checkout and run ./test.sh).

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 3
  • Comments: 24 (2 by maintainers)

Commits related to this issue

Most upvoted comments

if there is a way we can contribute, let us know

You can debug the code, find where is the root of the issue, fix it and submit a PR.

Ok @ghost-of-montevideo , just check you are not actually under the influence of the confusingly similar #2274

I was under the impression that the present issue is fixed and only affects the outdated buildkit vendored out-of-the-box within docker, as I was unable to reproduce it if I was using an updated buildkit plugin and the docker-container build driver. Unlike the present issue, the reproduction of #2274 is consistent for me.

@n1ngu I tried upgrading and the issue persists, multi-stage is broken

my fix was simply to not use multi-stage builds, but that of course is not a useful solution for everybody

I make more investigation: looks like cache is corrupted before upload stage. I comapre bad and good cache data (https://drive.google.com/drive/folders/1hzMWF_qBANvFmf3BeuQKe7KToGUOJp72?usp=sharing) and found litte difference.

Found difference

Good cache

cache_good contains two layers for RUN step:

  • sha256:48ffe3fe97d4a7f3ad3e09f613161e6f1a4f6b836751f0f0c93c8fd5ea92064a (linux/arm64)
  • sha256:52af553f4ee5a60ea433453c95375e457f139988034d49244afcc64b08e3331e (linux/amd64)

Bad cache

cache_bad contains only one layer for RUN step:

  • sha256:52af553f4ee5a60ea433453c95375e457f139988034d49244afcc64b08e3331e (linux/amd64)

Layer graph

graph
good --> g_38cc3[38cc3: FROM arm64]
good --> g_b71f9[b71f9: FROM amd64]
g_38cc3 --> g_48ffe[48ffe: RUN]
g_b71f9 --> g_52af5[52af5: RUN]
g_b71f9 --> g_7c10d[7c10d: COPY]
g_52af5 --> g_7c10d
g_38cc3 --> g_7c10d
g_48ffe --> g_7c10d

bad --> b_38cc3[38cc3: FROM arm64]
bad --> b_b71f9[b71f9: FROM amd64]
b_b71f9 --> b_2b352[2b352: COPY]
b_38cc3 --> b_2b352
b_b71f9 --> b_52af5[52af5: RUN]
b_52af5 --> b_ca41e[ca41e: COPY]

image

Spent a while on this issue and could not find a great resolution using buildkit, and there’s lots of unsolved issues about this across the web. After seeing https://github.com/moby/buildkit/issues/1981#issuecomment-1516704608 and another on SO, I switched our GitLab Cloud CI/CD pipelines to use buildah instead of buildkit and caching is working well now.

It’s (almost) a drop-in replacement, and using overlay2 as the storage driver, the build performance (caching disabled) seems to be the same as with buildkit.

Here’s a sample of what that could look like in a .gitlab-ci.yml file:

.build_template: &build_template
  image: quay.io/buildah/stable
  before_script:
    - buildah login [Login Params] [Registry URL]
  script:
    - buildah build
      --cache-from registry.example.com/myproject/myapp
      --tag registry.example.com/myproject/myapp:${CI_COMMIT_SHA}
      --cache-to registry.example.com/myproject/myapp
      --layers
      --storage-driver overlay2
      -f $DOCKERFILE_PATH $BUILD_CONTEXT_PATH
    - buildah push registry.example.com/myproject/myapp:${CI_COMMIT_SHA}

Im also happy to help a bit if there is a way we can contribute, let us know