moby: error creating zfs mount: no such file or directory

Description

docker build reports no such file or directory when running docker build on a ZFS.

Steps to reproduce the issue: Execute docker build when /var/lib/docker is mounted on a ZFS filesystem.

Describe the results you received:

Step 7/7 : COPY ${APP_BINARY_FILENAME} ${TOMCAT_HOME}/webapps/${APP_BINARY_FILENAME}
error creating zfs mount of kpool/docker/192a72289b6040ea928e8873c5f2695029bf856ada2c232df312a17775ddae9a to /var/lib/docker/zfs/graph/192a72289b6040ea928e8873c5f2695029bf856ada2c232df312a17775ddae9a: no such file or directory
ERROR: Job failed: error executing remote command: command terminated with non-zero exit code: Error executing in Docker Container: 1

Describe the results you expected: No errors

Additional information you deem important (e.g. issue happens only occasionally): Intermittent error. Usually docker build succeeds after three to four retries. Probably a race condition.

# df -h /var/lib/docker
Filesystem      Size  Used Avail Use% Mounted on
kpool/docker     45G   14M   45G   1% /var/lib/docker

# rpm -q zfs
zfs-0.7.8-1.el7_4.x86_64

Output of docker version:

Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:20:16 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:23:58 2018
  OS/Arch:      linux/amd64
  Experimental: false

Output of docker info:

Containers: 14
 Running: 14
 Paused: 0
 Stopped: 0
Images: 49
Server Version: 18.03.1-ce
Storage Driver: zfs
 Zpool: kpool
 Zpool Health: ONLINE
 Parent Dataset: kpool/docker
 Space Used By Parent: 3222265856
 Space Available: 48431497216
 Parent Quota: no
 Compression: lz4
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-862.2.3.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.5 (Maipo)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 15.51GiB
Name: hseeckm01
ID: VE37:YIDL:4NAH:TOQR:SUED:MTR5:MZBK:ZJI6:BVQV:CGMR:TVRX:XA2X
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
HTTP Proxy: http://proxy.company.com:8080
HTTPS Proxy: http://proxy.company.com8080
No Proxy: localhost,127.0.0.1,.internal.company.com
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: bridge-nf-call-ip6tables is disabled

Additional environment details (AWS, VirtualBox, physical, etc.):

About this issue

Original URL
State: open
Created 6 years ago
Reactions: 6
Comments: 17

Most upvoted comments

PS. the only a workaround I found for this issue is not to use the zfs storage driver at all (explicitly set "storage-driver": "aufs" in /etc/docker/daemon.json) Of course if you want to do this on an existing server, you will have to backup your volumes, destroy your data root and re-create it.

tobia on Feb 10, 2021

@stephan2012 I can confirm that this is still happening with containerd.io 1.4.3 and docker-ce 20.10.3 (from Docker’s official repo for debian buster)

This seems to be a race condition that happens when Docker’s root is on a ZFS volume and the build is multi-stage, containing a COPY --from=... If the source stage takes more time to build than the destination stage (and is not already in cache) this triggers the bug.

The reason it seems to work every n-th time is that the race condition is not triggered if the previous build step is already in cache. The way to reproduce this is:

use a ZFS installation, meaning that data-root is on a ZFS volume (even if daemon.json does not explicitly set "storage-driver": "zfs")
take a moderately complex multi-stage build, with COPY --from=... commands (any such build should trigger the bug, as long as the source stage take more time to build than the destination stage)
perform a build with --no-cache

See also https://github.com/moby/buildkit/issues/1758 it is the same bug IMHO

tobia on Feb 8, 2021