x11docker: Read-only file system (--init=systemd) [cgroupv2 not supported yet]

Hi,

I was trying to run Steam like so:

x11-docker steam --init=systemd --gpu --pulseaudio --home=/home/archbung/.local/share/Steam -V

where steam is a Docker image built using the following Dockerfile:

FROM ubuntu:20.10                                                                                                    
                                                                                                                     
ARG DEBIAN_FRONTEND=noninteractive                                                                                   
ENV TZ=Europe/Berlin                                                                                                 
                                                                                                                       
# Update and install packages                                                                                        
RUN dpkg --add-architecture i386 \                                                                                   
    && apt-get update -y \                                                                                           
    && apt-get install -y gdebi \                                                                                    
    libc6:i386 \                                                                                                     
    libgl1-mesa-dri:i386 \                                                                                           
    libgl1:i386 \                                                                                                    
    pciutils \                                                                                                       
    wget \                                                                                                           
    xdg-desktop-portal \                                                                                             
    xdg-desktop-portal-gtk \                                                                                         
    xdg-utils \                                                                                                      
    xterm                                                                                                            
                                                                                                                       
WORKDIR /tmp                                                                                                         
                                                                                                                       
RUN wget http://media.steampowered.com/client/installer/steam.deb && gdebi -n steam.deb                              
CMD ["steam"] 

However, x11docker terminated with the following error

Welcome to Ubuntu 20.10!

Set hostname to <ba7666b47c2c>.
Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

Could you give me some tips on troubleshooting this issue? The full x11docker.log can be found here.

Cheers,

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 24 (14 by maintainers)

Commits related to this issue

Most upvoted comments

The recommended way for running systemd in docker seems to be [1][2]: --privileged --cgroupns=host -v /sys/fs/cgroup:/sys/fs/cgroup:rw

Since I generally do not like to use =host options I tried to replicate what podman does with the docker cli. It is a bit hacky but it seems to be working. Tested on a headless Debian 11 system with docker.io+runc (container is fedora httpd with systemd). Systemd correctly detects and uses cgroupv2 (default-hierarchy=unified). I did not have time to check how one would integrate this with x11docker.

Based on podman container_internal_linux.go.

options=rw,rprivate,nosuid,nodev
docker run \
--tmpfs /run:$options \
--tmpfs /run/lock:$options \
--tmpfs /tmp:$options \
--tmpfs /var/log/journal:$options \
--cgroupns=private \
--rm \
--name t1 \
-d \
sysd \
/bin/sh -c 'sleep infinity; exec /sbin/init'
nsenter -t $(docker inspect -f '{{.State.Pid}}' t1) -m -p /bin/sh -c 'mount -o remount,rw /sys/fs/cgroup/ ; pkill sleep'
root@x11docker-test:~# docker top t1
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                20649               20631               0                   22:43               ?                   00:00:00            /bin/sh -c sleep infinity; exec /sbin/init
root                20682               20649               0                   22:43               ?                   00:00:00            sleep infinity
root@x11docker-test:~# nsenter -t $(docker inspect -f '{{.State.Pid}}' t1) -m -p /bin/sh -c 'mount -o remount,rw /sys/fs/cgroup/ ; pkill sleep'
root@x11docker-test:~# docker top t1
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                20649               20631               0                   22:43               ?                   00:00:00            /sbin/init
81                  20745               20649               0                   22:43               ?                   00:00:00            /usr/bin/dbus-broker-launch --scope system --audit
81                  20746               20745               0                   22:43               ?                   00:00:00            dbus-broker --log 4 --controller 9 --machine-id 7ef69b02670d444395b88e5c297fdbd0 --max-bytes 536870912 --max-fds 4096 --max-matches 16384 --audit
root                20747               20649               0                   22:43               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
48                  20748               20747               0                   22:43               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
48                  20749               20747               0                   22:43               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
48                  20750               20747               0                   22:43               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
48                  20752               20747               0                   22:43               ?                   00:00:00            /usr/sbin/httpd -DFOREGROUND
root                20727               20649               0                   22:43               ?                   00:00:00            /usr/lib/systemd/systemd-journald
root                20743               20649               0                   22:43               ?                   00:00:00            /usr/lib/systemd/systemd-logind
systemd+            20734               20649               0                   22:43               ?                   00:00:00            /usr/lib/systemd/systemd-oomd
193                 20735               20649               0                   22:43               ?                   00:00:00            /usr/lib/systemd/systemd-resolved
root                20737               20649               0                   22:43               ?                   00:00:00            /usr/lib/systemd/systemd-userdbd
root                20738               20737               0                   22:43               ?                   00:00:00            systemd-userwork
root                20739               20737               0                   22:43               ?                   00:00:00            systemd-userwork
root                20740               20737               0                   22:43               ?                   00:00:00            systemd-userwork
root                20742               20737               0                   22:43               ?                   00:00:00            systemd-userwork


systemd v249.9-1.fc35 running in system mode (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization docker.
Detected architecture x86-64.

Initially, I tried using docker exec --privileged but that is not working as one would think [3]. Creating a container with CAP_SYS_ADMIN remounting rw and dropping the CAP and then exec is also a problem since AppArmor blocks that. The nsenter method does not grant extra capabilities and also works without having to disable AppArmor. You can also wrap the nsenter command inside a docker run --privileged --pid=host[4].

that sucks yes. I’m doing what I can to protect the workload in other ways

I am not sure why your unprivileged setup fails. If you like to, you could try to run your worker with x11docker and its option --init=systemd to check if a more secure setup would work.

Using --privileged --cgroupns=host -v /sys/fs/cgroup:/sys/fs/cgroup:rw as @lukts30 to in this https://github.com/mviereck/x11docker/issues/349#issuecomment-1034346442 makes the difference.

Great that this helped you! However, just want to note that this setup exposes your host to the container and is quite insecure. Don’t use it if there is any reason to distrust the container because basically no isolation is left.

Thanks for the investigation! x11docker now uses stat for the cgroup version check.

Confusing: Other than I assumed my Debian bullseye installation seems to run cgroupv2 only by default (i.e. without kernel options). The nsenter setup succeeds.

If I set kernel option systemd.unified_cgroup_hierarchy=0 to have cgroupv1 only, I seem to get a hybrid system according to check of /sys/fs/cgroup/unified. But the nsenter setup fails in this case. The old setup with shared host cgroups is needed.

I don’t know an option to get a real hybrid setup.

So x11docker would need two checks:

  • Running --init=systemd on a pure cgroupv1 system. (Not sure if any are out in the wild.)
  • Running --init=systemd on a real hybrid system.

Currently x11docker is configured to use the nsenter setup only on a pure cgroupv2 system. For cgroupv1 and hybrid it falls back to the old behaviour sharing host cgroups.

Additionally, instead of joining the host PID NS it is also possible to join the other containers PID NS and therefore no longer needs to use docker inspect.

Good catch!

I’ve almost literally integrated your command in x11docker, works like a charm now. I still have to add --cap-add=SYS_PTRACE, did you remove it intentionally?

--init=systemd works ootb now in hybrid system and in cgroupv2-only system. It fails yet if I set the (previously recommended) GRUB kernel option systemd.unified_cgroup_hierarchy=0. One has to set x11docker option --sharecgroup to enable the old setup.

Currently I miss a way to detect if a system is set up with cgroupv1 only although the kernel supports cgroupv2. The check grep -q cgroup2 /proc/filesystems && Cgroupversion="v2" || Cgroupversion="v1" always results in “v2”.

Curious: Debian buster containers still report default-hierarchy=hybrid (but work nonetheless), while Debian bullseye containers report default-hierarchy=unified.

I have just reread the nsenter man page and it might be good to also join the cgroup namespace (-C) in addition to the mount and PID namespace. Even though a remount seems to work since it is atomic if one would instead umount and then mount being in the same cgroup namespace would be required. At least that is how I understand it.

Additionally, instead of joining the host PID NS it is also possible to join the other containers PID NS and therefore no longer needs to use docker inspect.

docker run --cap-add SYS_ADMIN --security-opt apparmor=unconfined --pid=container:t1 --rm nsfed \
nsenter -t 1 -m -p -C /bin/sh -c 'mount -o remount,rw /sys/fs/cgroup/ ; pkill sleep'

EDIT: If I apply the same procedure to a rootless podman container that was created with --systemd=false the remount fails with EPERM but doing

umount /sys/fs/cgroup/ && mount -t cgroup2 cgroup2 /sys/fs/cgroup/ -o rw

after a podman run ... nsenter -t 1 -m -p -C still works. Should not really matter since podman has systemd support built-in but interesting to know.

The same does not work with a docker container where the daemon is running with --userns-remap. In this case /sys/fs/cgroup/ is already mounted rw but owned by real root and therefore appears to be owned by nobody from inside the container.

The issue is still present on a minimal untweaked Arch Linux install with podman and crun (cgroupv2 only).

But it can be easily fixed without modifying the grub cmdline. Just passing --systemd=always to podman makes the issue disappear.

Does not work:
x11docker --xephyr --desktop --init=systemd localhost/ubuntu:gnome

Does work:
x11docker --xephyr --desktop --init=systemd -- --systemd=always -- localhost/ubuntu:gnome

Blog post about –systemd=always and podman cgroupv2

Thanks

GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"

in /etc/default/grub fixed the issue