buildkit: COPY --from consistently fails with ZFS storage driver
Running any Dockerfile that uses COPY --from when the previous layer wasn’t already in the cache, appears to cause the following issue:
$ DOCKER_BUILDKIT=1 docker build -t $CONTAINER_REGISTRY/avengers/multiplexer-operator:adamslevy-test -f ./cmd/cp-metrics-multiplexer/Dockerfile --build-arg GOPROXY=http://localhost .
[+] Building 14.2s (14/14) FINISHED
=> [internal] load build definition from Dockerfile 0.1s
=> => transferring dockerfile: 38B 0.0s
=> [internal] load .dockerignore 0.2s
=> => transferring context: 35B 0.0s
=> [internal] load metadata for docker.io/library/gobase:latest 0.0s
=> [internal] load metadata for docker.io/library/alpine:3.10.2 0.0s
=> [builder 1/2] FROM docker.io/library/gobase 0.0s
=> CACHED [stage-1 1/3] FROM docker.io/library/alpine:3.10.2 0.0s
=> [internal] load build context 1.3s
=> => transferring context: 284.53kB 1.2s
=> CACHED [builder 2/2] COPY Makefile go.mod go.sum ./ 0.0s
=> CACHED [builder 3/2] COPY internal ./internal 0.0s
=> CACHED [builder 4/2] COPY pkg ./pkg 0.0s
=> [builder 5/2] COPY cmd ./cmd 2.0s
=> [builder 6/2] RUN make multiplexer-operator 9.8s
=> CANCELED [stage-1 2/3] COPY --from=builder /src/bin/multiplexer-operator /multiplexer-operator 0.3s
=> ERROR [stage-1 3/3] COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/ca-certificates.crt 0.0s
------
> [stage-1 3/3] COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/ca-certificates.crt:
------
failed to solve with frontend dockerfile.v0: failed to build LLB: failed to compute cache key: error creating zfs mount: mount zroot/ROOT/default/p2ktr8qsvusedj4edsddrepei:/var/lib/docker/zfs/graph/p2ktr8qsvusedj4edsddrepei: no such file or directory
Other info:
$ docker info
Client:
Debug Mode: false
Server:
Containers: 35
Running: 0
Paused: 0
Stopped: 35
Images: 630
Server Version: 19.03.13-ce
Storage Driver: zfs
Zpool: zroot
Zpool Health: ONLINE
Parent Dataset: zroot/ROOT/default
Space Used By Parent: 217493699584
Space Available: 753881241088
Parent Quota: no
Compression: lz4
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: c623d1b36f09f8ef6536a057bd658b3aa8632828.m
runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 5.9.1-zen2-1-zen
Operating System: Arch Linux
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.25GiB
Name: clevo.adamarch
ID: GV3Y:62D5:V5TB:TQZG:U6QA:MOAL:MVLT:7ORI:VJ2V:4W6L:4EQJ:2W7R
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Please let me know if I can provide any further helpful info.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 12
- Comments: 39 (7 by maintainers)
Commits related to this issue
- Add locking to the ZFS driver Trying to build Docker images with buildkit using a ZFS-backed storage was unreliable due to apparent race condition between adding and removing layers to the storage (s... — committed to jaen/moby by jaen 2 years ago
- Add locking to the ZFS driver Trying to build Docker images with buildkit using a ZFS-backed storage was unreliable due to apparent race condition between adding and removing layers to the storage (s... — committed to 3nprob/moby by jaen 2 years ago
- Add locking to the ZFS driver Trying to build Docker images with buildkit using a ZFS-backed storage was unreliable due to apparent race condition between adding and removing layers to the storage (s... — committed to jaen/moby by jaen 2 years ago
Ok, so I have built docker with this commit https://github.com/jaen/moby/commit/a4ab9f592a802215be69a5463f48d763c7fd8705: and it seems to fix the issue, I can’t reproduce it using the script I posted above anymore.
Made a PR with it, @saiarcot895 thanks a lot for pointing me in the right direction!
Yup, let’s go ahead and close this one 👍
@oramirite I’ve since moved to NixOS so can’t easily test this works for sure, but I think you would just want to use
docker-git’sPKGBUILD(see: https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=docker-git) and change themobyrepository (first entry ofsourcesarray) to point to my fork (and branch, apparently it would be adding#branch=zfs-driver-fixto the end, cf. https://wiki.archlinux.org/title/VCS_package_guidelines#VCS_sources).@tonistiigi I see locks on the complete
Get(),Put(), andRemove()methods for the overlay, aufs, and devmapper drivers, added in moby/moby@fc1cf19. The critical thing here is that the entire method has a lock (for each ID), so that the above case where one thread is going throughPut()and another thread is going throughGet()concurrently doesn’t happen.@saiarcot895 If you compare this with overlay/aufs then afaics there the reference counter logic is always behind an id-based locker.
With BuildKit, it is possible for
Get()andPut()to be called from different threads within the dockerd process. Additionally, they may be called multiple times in succession with the same ID (i.e.Get()can be called with the same ID 3 times in a row for some cache mount).Briefly looking at the zfs code, I could maybe see how there can be a race condition related to the mount directory being missing.
Assume that there are two threads, and some layer with ID 824f183 has been mounted once by thread 1. Now, thread 1 wants to unmount it, and thread 2 wants to mount that same layer. Here’s a possible execution sequence which will result in the above error:
In this sequence, the only mutex/lock that is present is in the refcount increment/decrement (this is from code outside of the ZFS file); the rest don’t have any mutexes.
@tonistiigi I think I have a fairly minimal reproduction, a script follows:
This fails fairly reliably for me if I run
docker system prune -afirst – sometimes it will just complain about the cachekey thing, sometimes it will also complain about theDockerfilemissing, but not always, output follows:Any idea how hard this could be to fix? I’m using ZFS and I need to use buildkit with it, so it’s a fairly blocking issue for me.
Please provide a reproducible testcase.