moby: Unable to run systemd in docker with ro /sys/fs/cgroup after systemd 248 host upgrade


BUG REPORT INFORMATION

I used to run docker containers with systemd as CMD without having to expose /sys/fs/cgroup as rw; this worked until systemd 248 on the host. Now it fails with

Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

I opened a related issue on the systemd github repo: https://github.com/systemd/systemd/issues/19245

Workarounds

  • boot host with systemd.unified_cgroup_hierarchy=0
  • remove ro flag from docker run arg -v /sys/fs/cgroup:/sys/fs/cgroup:ro but this contaminates the host cgroup, causing e.g. docker top to get confused:
docker top debian-systemd
Error response from daemon: runc did not terminate successfully: container_linux.go:186: getting all container pids from cgroups caused: lstat /sys/fs/cgroup/system.slice/docker-817dfec3facbeb10c64d7b0fae478804b1177ae949e695e111b7c693569dd21a.scope: no such file or directory
: unknown

Steps to reproduce the issue:

Dockerfile:

FROM debian:buster-slim

ENV container docker
ENV LC_ALL C
ENV DEBIAN_FRONTEND noninteractive

USER root
WORKDIR /root

RUN set -x

RUN apt-get update -y \
    && apt-get install --no-install-recommends -y systemd \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* \
    && rm -f /var/run/nologin

RUN rm -f /lib/systemd/system/multi-user.target.wants/* \
    /etc/systemd/system/*.wants/* \
    /lib/systemd/system/local-fs.target.wants/* \
    /lib/systemd/system/sockets.target.wants/*udev* \
    /lib/systemd/system/sockets.target.wants/*initctl* \
    /lib/systemd/system/sysinit.target.wants/systemd-tmpfiles-setup* \
    /lib/systemd/system/systemd-update-utmp*

VOLUME [ "/sys/fs/cgroup" ]

CMD ["/lib/systemd/systemd"]

Expected behaviour

systemd 247 (247.4-2-arch)
+PAM +AUDIT -SELINUX -IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid
$ docker build -t debian-systemd .
$ docker run -t --tmpfs /run --tmpfs /run/lock --tmpfs /tmp -v /sys/fs/cgroup:/sys/fs/cgroup:ro debian-systemd
systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.

Welcome to Debian GNU/Linux 10 (buster)!

Set hostname to <bf431002c7c1>.
Couldn't move remaining userspace processes, ignoring: Input/output error
File /lib/systemd/system/systemd-journald.service:12 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)
[  OK  ] Listening on Journal Socket.
...
[  OK  ] Reached target Graphical Interface.

Actual behaviour

Since systemd v248

$ /lib/systemd/systemd --version
systemd 248 (248-3-arch)
+PAM +AUDIT -SELINUX -APPARMOR -IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP -SYSVINIT default-hierarchy=unified

$ docker build -t debian-systemd .
$ docker run -t --tmpfs /run --tmpfs /run/lock --tmpfs /tmp -v /sys/fs/cgroup:/sys/fs/cgroup:ro debian-systemd
systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.

Welcome to Debian GNU/Linux 10 (buster)!

Set hostname to <fbb4fc19cb95>.
Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

Output of docker version:

$ docker version
Client:
 Version:           20.10.5
 API version:       1.41
 Go version:        go1.16
 Git commit:        55c4c88966
 Built:             Wed Mar  3 16:51:54 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.5
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16
  Git commit:       363e9a88a1
  Built:            Wed Mar  3 16:51:28 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.4.4
  GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e.m
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Output of docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-tp-docker)

Server:
 Containers: 10
  Running: 1
  Paused: 0
  Stopped: 9
 Images: 61
 Server Version: 20.10.5
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e.m
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.11.11-arch1-1
 Operating System: Arch Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 7.712GiB
 Name: homepc
 ID: 67YO:62DZ:3NIF:TZT3:HTXP:BU6I:YBR3:XETA:7YCB:YGNN:MV6Q:QYN4
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
  https://mirror.gcr.io/
 Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

x86_64 Intel hw, Arch Linux 5.11.11-arch1-1

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 9
  • Comments: 27

Commits related to this issue

Most upvoted comments

remove ro flag from docker run arg -v /sys/fs/cgroup:/sys/fs/cgroup:ro

It didn’t help. I’m running Ubuntu 21.10 (Impish Indri).

For reference, it is possible with namespace isolation.

@skast96, it didn’t help either. I edited /etc/docker/daemon.json:

{"userns-remap": "default"}

Restarted docker. The dockremap user was created, as were the entries in /etc/sub{uid,gid}. The /var/lib/docker/100000.100000 dir was created. docker image ls produced no output. Then:

$ docker run -it --tmpfs /tmp --tmpfs /run --tmpfs /run/lock -v /sys/fs/cgroup:/sys/fs/cgroup jrei/systemd-ubuntu
systemd 245.4-4ubuntu3.16 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.

Welcome to Ubuntu 20.04.4 LTS!

Set hostname to <1bdd4443336d>.
Failed to create /init.scope control group: Permission denied
Failed to allocate manager object: Permission denied
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

So the only workaround is supposedly to switch to the cgroup v1 mode (systemd.unified_cgroup_hierarchy=0):

  • /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="systemd.unified_cgroup_hierarchy=0"
  • update-grub
  • reboot

UPD And --cgroupns=host + -v /sys/fs/cgroup:/sys/fs/cgroup (w/o :ro), e.g.:

$ docker run -it --cgroupns=host --tmpfs /tmp --tmpfs /run --tmpfs /run/lock \
    -v /sys/fs/cgroup:/sys/fs/cgroup jrei/systemd-ubuntu

I have discovered two additional workarounds for this issue that effectively retain all features of unified cgroupv2 while maintaining security - no need for the --privileged flag and no access to the root of cgroupv2 hierarchy:

  1. Use the --cgroupns host Docker option and a cgroupv2 sub-hierarchy volume binding for the container. Here is an example command:
# docker run --rm --name freeipa -it --read-only --security-opt seccomp=unconfined --hostname freeipa.corp --init=false --cgroupns host -v /sys/fs/cgroup/freeipa.scope:/sys/fs/cgroup:rw freeipa/freeipa-server:almalinux-9
systemd 252-13.el9_2 running in system mode (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS -FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT -QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization container-other.
Detected architecture x86-64.
Initializing machine ID from random generator.
Queued start job for default target Minimal target for containerized FreeIPA server.
[..]

Not perfect, next option is better IMO.

  1. Mount /sys/fs/cgroup on the host without the nsdelegate mount option. Although there isn’t an explicit option to disable nsdelegate like nodiscard for discard (see link 1, link 2 for more information), there is a workaround. Simply run any container using Docker with the --cgroupns host option and without any cgroup volume bindings. For example:
# grep cgroup /proc/mounts 
cgroup2 /sys/fs/cgroup cgroup2 rw,seclabel,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0
# docker run --rm --cgroupns host ubuntu:latest echo done
done
# grep cgroup /proc/mounts 
cgroup2 /sys/fs/cgroup cgroup2 rw,seclabel,nosuid,nodev,noexec,relatime 0 0

After implementing these steps, you can run a container with Docker using --cgroupns private flag and volume binding of cgroupv2 sub-hierarchy. For example:

# docker run --rm --name freeipa -it --read-only --security-opt seccomp=unconfined --hostname freeipa.corp --init=false --cgroupns private -v /sys/fs/cgroup/freeipa.scope:/sys/fs/cgroup:rw freeipa/freeipa-server:almalinux-9
systemd 252-13.el9_2 running in system mode (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS -FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT -QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization container-other.
Detected architecture x86-64.
Initializing machine ID from random generator.
Queued start job for default target Minimal target for containerized FreeIPA server.
[..]

Please note that the information provided above applies specifically to CentOS Stream release 9 with kernel-ml-6.3.7-1.el9.elrepo, systemd-252.4-598.13.hs.el9 (Hyperscale SIG) and docker-ce-24.0.2-1 (systemd cgroup driver) although may help with a wide range of different scenarios.

@marco-a-itl

This is the mount shown in the container:

# findmnt | grep cgroup
│ └─/sys/fs/cgroup      cgroup      cgroup2  rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot

This is ok (mode is rw). However I assume that you obtained this result with userns-remapping.

I think that it should be possible to have the same result without such daemon option, with the proper modifications on the docker engine, like podman does.

  1. Use the --cgroupns host Docker option and a cgroupv2 sub-hierarchy volume binding for the container. Here is an example command:

docker run --rm --name freeipa -it --read-only --security-opt seccomp=unconfined --hostname freeipa.corp --init=false --cgroupns host -v /sys/fs/cgroup/freeipa.scope:/sys/fs/cgroup:rw freeipa/freeipa-server:almalinux-9

Thanks. This helped me too in starting a docker container with systemd inside (Fedora 37 host with cgroupv2).

I needed to add to daemon.json (and create the dockuser user on the docker host):

{
    "userns-remap": "dockuser"
}

I left out some of the options you used though:

docker run \
-it \
--rm \
--name ubuntu_systemd_local \
--tmpfs /tmp \
--tmpfs /run \
--tmpfs /run/lock \
--cgroupns private \
ubuntu_systemd:local

I used this Dockerfile: (I created encrypted_password with mkpasswd -m sha512crypt ‘password’)

FROM ubuntu:22.04
ENV DEBIAN_FRONTEND noninteractive
RUN yes | unminimize && \
echo 'root:_encrypted_password_' | chpasswd -e && \
sed -i -e 's/archive.ubuntu/fi.archive.ubuntu/g' /etc/apt/sources.list && \
apt-get -y update && \
apt-get -y install apt-utils && \
apt-get -y install dialog && \
apt-get -y install iputils-ping bind9-host iproute2 netcat-openbsd && \
apt-get -y install systemd dbus dbus-user-session dbus-x11 dconf-cli && \
apt-get -y install vim less nmon glances iptraf-ng \
cifs-utils elinks elinks-data \
irssi lftp mc mc-data unrar nmap ctorrent iotop powertop \
w3m radvd caca-utils httpie jq firejail curl nmap stress-ng \
cksfv mtr htop smem gddrescue oidentd ntpdate sysfsutils \
cpulimit expect stress-ng pavucontrol rtorrent screen telnet \
cabextract youtube-dl sshuttle emacs nethogs alien \
exfatprogs p7zip mosh keepassxc virt-what fdisk && \
curl -s https://packagecloud.io/install/repositories/ookla/speedtest-cli/script.deb.sh | bash && \
apt-get -y install speedtest
STOPSIGNAL SIGRTMIN+3
CMD [ "/sbin/init" ]

@x-yuri the docker approach is not working that great tbh. It is working with namespace isolation when creating a extra slice for docker and adding this slice to the docker run command like so:

docker run -it \
    --cgroup-parent=docker.slice \
    --cgroupns private \
    --tmpfs /tmp \
    --tmpfs /run \
    --tmpfs /run/lock \
    mySystemdImage:latest 

That kinda worked for me. However our other containers stopped working with namespace isolation because they were not configured for that. That meant to much work in order to run one container with systemd.

So I suggest you to just install podman. I experienced no drawbacks on my Arch Linux when having both docker and podman installed. Even the commands are the same. You would start your systemd container like that below with podman.

podman run -it mySystemdImage:latest 

Is there already a fix for this?

For reference, it is possible with namespace isolation. https://docs.docker.com/engine/security/userns-remap/ Or simply install podman.

Same here It was working with 247