sysbox: Docker build inside sysbox container results in "lchown ... no such file or directory" errors

Hi there,

I am attempting to solve our CI/CD woes using sysbox and I was really excited to have it working, until it didn’t.

Using a dotnet restore with dotnetcore image inside a docker build is failing with a very generic message:

Error processing tar file(exit status 1): lchown /tmp/clr-debug-pipe-202-24216845-in: no such file or directory

see here and here for more information

I am able to solve this by adding COMPlus_EnableDiagnostics=0 as an ENV in the Dockerfile or by passing it from docker-compose and using ARG in Dockerfile. However, I really don’t want to have to alter a ton of Dockerfiles for a bunch of microservices, and I don’t want to have to disable debugging, which is what that flag does.

How to reproduce: create a Dockerfile using mcr.microsoft.com/dotnet/core/sdk:3.1-buster image and then either pull a dotnet repo that does a dotnet restore in the Dockerfile

Things I have tried:

running on normal Docker/non-dind = works as intended
running on dind using privileged flag and mounting /lib/var/docker as a volume and running nested = works
running with sysbox as runtime and:

added cap_add - ALL to first docker-compose = fails
added cap_add - ALL to inner docker-compose = fails

I was able to do an strace on both the docker daemon when using standard docker and then using dind with sysbox, here are a few snippets

standard: -mknodat(AT_FDCWD, "/tmp/clr-debug-pipe-78-63381236-in", S_IFIFO|0700) = 0 -fchownat(AT_FDCWD, "/tmp/clr-debug-pipe-78-63381236-in", 0, 0, AT_SYMLINK_NOFOLLOW) = 0 -fchmodat(AT_FDCWD, "/tmp/clr-debug-pipe-78-63381236-in", 0700) = 0 -utimensat(AT_FDCWD, "/tmp/clr-debug-pipe-78-63381236-in", [{tv_sec=1610522665, tv_nsec=0} /* 2021-01-13T07:24:25+0000 */, {tv_sec=1610522665, tv_nsec=0} /* 2021-01-13T07:24:25+0000 */], 0) = 0

sysbox:

-newfstatat(AT_FDCWD, "/var/lib/docker/overlay2/ca60dd45565e9b2b10754f95f7058ff401485ccca62253f28a522f105018c9b2-init/merged/tmp/clr-debug-pipe-225-63286646-in", 0xc00192e6b8, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory) -newfstatat(AT_FDCWD, "/var/lib/docker/overlay2/ca60dd45565e9b2b10754f95f7058ff401485ccca62253f28a522f105018c9b2/merged/tmp/clr-debug-pipe-225-63286646-in", {st_mode=S_IFIFO|0700, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 -lgetxattr("/var/lib/docker/overlay2/ca60dd45565e9b2b10754f95f7058ff401485ccca62253f28a522f105018c9b2/merged/tmp/clr-debug-pipe-225-63286646-in", "security.capability", 0xc00192a700, 128) = -1 ENODATA (No data available) -newfstatat(AT_FDCWD, "/var/lib/docker/overlay2/ca60dd45565e9b2b10754f95f7058ff401485ccca62253f28a522f105018c9b2-init/merged/tmp/clr-debug-pipe-225-63286646-out", 0xc00192e858, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)

As you can see, it doesn’t seem to be able to run any of the syscalls like mknodat, fchmodat, etc. Which is why I was hoping adding the cap_add would solve this. Both containers are running as root.

Any help on this would be much appreciated!

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 34 (17 by maintainers)

Most upvoted comments

Hi @nudgegoonies, had to dig a bit to get to the bottom of this one, but I think I’ve found the reason for the problem.

First, I reproduced the problem by launching a sysbox container (with the nestybox/ubuntu-focal-systemd-docker image), and inside of it launching the docker CLI and docker daemon containers as follows:

$ docker network create some-network                                                                                                                                                                                                                                                                                          
$ docker volume create --name docker-dind                                                                                                                                                                                                                                                                                     
$ docker pull docker:20.10.2-dind  

# Inner Docker dind container:
$ docker run --privileged --name dind -d  -v docker-dind:/var/lib/docker  --network some-network --network-alias docker     -e DOCKER_TLS_CERTDIR=/certs     -v dind-certs-ca:/certs/ca     -v dind-certs-client:/certs/client docker:20.10.2-dind

# Inner Docker CLI container:
$ docker run -it --rm     --network some-network     -e DOCKER_TLS_CERTDIR=/certs     -v dind-certs-client:/certs/client:ro     docker:latest sh

Then, from the inner Docker CLI container, I pulled the oracle database container image you mentioned above.

The pull failed with:

failed to register layer: Error processing tar file(exit status 1): lchown /dev/initctl: no such file or directory

I then straced the docker pull operation, I found that the failure occurs in the fchownat() syscall below:

2407193 newfstatat(AT_FDCWD, "/dev/initctl", 0xc000a6eac8, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
2407193 fchownat(AT_FDCWD, "/dev/initctl", 0, 0, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
2407193 write(2, "lchown /dev/initctl: no such fil"..., 46) = 46

Basically, it looks like this image requires /dev/initctl; as a result, Docker is looking for /dev/initctl during the image extraction but this device does not exist within the ephemeral docker container (spawned inside the dind container) where the extraction is taking place.

I then repeated the experiment by running the same commands above, but this time at host level (i.e., not inside the sysbox container). Interestingly, this time things worked. I straced the docker daemon, I found the following:

2496154 newfstatat(AT_FDCWD, "/dev/initctl",  <unfinished ...>
2496154 <... newfstatat resumed>0xc000a3cc68, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
2496154 mknodat(AT_FDCWD, "/dev/initctl", S_IFIFO|0600 <unfinished ...>
2496154 <... mknodat resumed>)          = 0
2496154 fchownat(AT_FDCWD, "/dev/initctl", 0, 0, AT_SYMLINK_NOFOLLOW <unfinished ...>
2496154 <... fchownat resumed>)         = 0

Notice the difference: docker called mknod on /dev/initctl during the image extraction. As a result, the subsequent fchownat() worked fine.

So why did Docker not call mknod when the dind image run inside the sysbox container, but did call it when the dind image run on the host?

Looking at the Docker code, it appears the answer is here:

 186 │       case mode&os.ModeDevice != 0:                                                                                                                                                                                                                                                                                    
 187 │          if sys.RunningInUserNS() {                                                                                                                                                                                                                                                                                    
 188 │             // cannot create a device if running in user namespace                                                                                                                                                                                                                                                     
 189 │             return nil                                                                                                                                                                                                                                                                                                 
 190 │          }                                                                                                                                                                                                                                                                                                             
 191 │          if err := unix.Mknod(dstPath, stat.Mode, int(stat.Rdev)); err != nil {                                                                                                                                                                                                                                        
 192 │             return err                                                                                                                                                                                                                                                                                                 
 193 │          }

Since Sysbox containers always use the Linux user-namespace (for strong isolation), the Docker daemon running inside the inner dind container is refusing to use mknod to create the /dev/initctl device required by the Oracle image. As a result, the subsequent fchownat() fails.

This explains the failure. It’s really caused by Docker’s assumption that within a user-ns mknod is not allowed. This is generally true, but does not take into account that container runtimes like Sysbox (or LXD for example) can deal properly with such operations by virtue of intercepting the mknod syscall, examining if it’s allowed, and if so handling it on behalf of the container. Thus, it would be better if Docker had called mknod() and if it failed, optionally check if it’s running in userns().

As far as a solution, I don’t have a good one right now. The only work-around I found was to use the docker:19.03.2-dind image instead of the docker:20.10.2-dind image (which suggests the Docker source code check for userns I copied above must have been recently added).

I’ll think if there is some other solution to make this work with docker:20.10.2-dind.

ctalledo on May 29, 2021

Closing due to inactivity; please re-open if problem re-occurs.

ctalledo on Aug 15, 2023

This image is also affected by just pulling within a dind: gitlab/gitlab-ee:13.12.6-ee.0

nudgegoonies on Jul 5, 2021

Hi @nudgegoonies ,

One question comes to my mind as this behavior comes from using userns. Would shiftfs help in this situation? There are already “inofficial” dkms solutions available for running shiftfs kernel module on Debian.

No it won’t unfortunately. Sysbox always creates containers with the Linux user-namespace (for strong isolation), regardless of whether shiftfs is present or not. Thus, the inner Docker will refuse to mknod() and the docker pull of the oracle database container imagewill fail.

The presence of shiftfs in the kernel is complementary to user-ns: if present in the kernel, it means Docker can continue to create the container’s filesystem with host root:root ownership, yet the container will have access to it even though the container’s root user is not the host’s root user (by virtue of Sysbox using the user-namespace). Without shiftfs, Docker needs to create the container’s filesystem with ownership that matches the container’s root user (i.e., Docker must be configured in userns-remap mode).

Hope that helps.

ctalledo on Jun 9, 2021