moby: Unable to run systemd in docker with ro /sys/fs/cgroup after systemd 248 host upgrade
BUG REPORT INFORMATION
I used to run docker containers with systemd as CMD without having to expose /sys/fs/cgroup as rw; this worked until systemd 248 on the host. Now it fails with
Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...
I opened a related issue on the systemd github repo: https://github.com/systemd/systemd/issues/19245
Workarounds
- boot host with systemd.unified_cgroup_hierarchy=0
- remove ro flag from docker run arg -v /sys/fs/cgroup:/sys/fs/cgroup:ro but this contaminates the host cgroup, causing e.g. docker top to get confused:
docker top debian-systemd
Error response from daemon: runc did not terminate successfully: container_linux.go:186: getting all container pids from cgroups caused: lstat /sys/fs/cgroup/system.slice/docker-817dfec3facbeb10c64d7b0fae478804b1177ae949e695e111b7c693569dd21a.scope: no such file or directory
: unknown
Steps to reproduce the issue:
Dockerfile:
FROM debian:buster-slim
ENV container docker
ENV LC_ALL C
ENV DEBIAN_FRONTEND noninteractive
USER root
WORKDIR /root
RUN set -x
RUN apt-get update -y \
&& apt-get install --no-install-recommends -y systemd \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* \
&& rm -f /var/run/nologin
RUN rm -f /lib/systemd/system/multi-user.target.wants/* \
/etc/systemd/system/*.wants/* \
/lib/systemd/system/local-fs.target.wants/* \
/lib/systemd/system/sockets.target.wants/*udev* \
/lib/systemd/system/sockets.target.wants/*initctl* \
/lib/systemd/system/sysinit.target.wants/systemd-tmpfiles-setup* \
/lib/systemd/system/systemd-update-utmp*
VOLUME [ "/sys/fs/cgroup" ]
CMD ["/lib/systemd/systemd"]
Expected behaviour
systemd 247 (247.4-2-arch)
+PAM +AUDIT -SELINUX -IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid
$ docker build -t debian-systemd .
$ docker run -t --tmpfs /run --tmpfs /run/lock --tmpfs /tmp -v /sys/fs/cgroup:/sys/fs/cgroup:ro debian-systemd
systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.
Welcome to Debian GNU/Linux 10 (buster)!
Set hostname to <bf431002c7c1>.
Couldn't move remaining userspace processes, ignoring: Input/output error
File /lib/systemd/system/systemd-journald.service:12 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)
[ OK ] Listening on Journal Socket.
...
[ OK ] Reached target Graphical Interface.
Actual behaviour
Since systemd v248
$ /lib/systemd/systemd --version
systemd 248 (248-3-arch)
+PAM +AUDIT -SELINUX -APPARMOR -IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP -SYSVINIT default-hierarchy=unified
$ docker build -t debian-systemd .
$ docker run -t --tmpfs /run --tmpfs /run/lock --tmpfs /tmp -v /sys/fs/cgroup:/sys/fs/cgroup:ro debian-systemd
systemd 241 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
Detected virtualization docker.
Detected architecture x86-64.
Welcome to Debian GNU/Linux 10 (buster)!
Set hostname to <fbb4fc19cb95>.
Failed to create /init.scope control group: Read-only file system
Failed to allocate manager object: Read-only file system
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...
Output of docker version
:
$ docker version
Client:
Version: 20.10.5
API version: 1.41
Go version: go1.16
Git commit: 55c4c88966
Built: Wed Mar 3 16:51:54 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server:
Engine:
Version: 20.10.5
API version: 1.41 (minimum version 1.12)
Go version: go1.16
Git commit: 363e9a88a1
Built: Wed Mar 3 16:51:28 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.4.4
GitCommit: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e.m
runc:
Version: 1.0.0-rc93
GitCommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Output of docker info
:
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Build with BuildKit (Docker Inc., v0.5.1-tp-docker)
Server:
Containers: 10
Running: 1
Paused: 0
Stopped: 9
Images: 61
Server Version: 20.10.5
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e.m
runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
init version: de40ad0
Security Options:
seccomp
Profile: default
Kernel Version: 5.11.11-arch1-1
Operating System: Arch Linux
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 7.712GiB
Name: homepc
ID: 67YO:62DZ:3NIF:TZT3:HTXP:BU6I:YBR3:XETA:7YCB:YGNN:MV6Q:QYN4
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Registry Mirrors:
https://mirror.gcr.io/
Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.):
x86_64 Intel hw, Arch Linux 5.11.11-arch1-1
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 9
- Comments: 27
Commits related to this issue
- Don't rely on systemd to run minimega components See https://github.com/moby/moby/issues/42275. — committed to activeshadow/minimega by activeshadow 3 years ago
- Don't rely on systemd to run minimega components See https://github.com/moby/moby/issues/42275. — committed to activeshadow/minimega by activeshadow 3 years ago
- Don't rely on systemd to run minimega components See https://github.com/moby/moby/issues/42275. — committed to activeshadow/minimega by activeshadow 3 years ago
- Don't rely on systemd to run minimega components See https://github.com/moby/moby/issues/42275. — committed to activeshadow/minimega by activeshadow 3 years ago
- Don't rely on systemd to run minimega components See https://github.com/moby/moby/issues/42275. — committed to activeshadow/minimega by activeshadow 3 years ago
- Include option to use docker instead of vagrant In the past, I used vagrant -> libvirt to run acceptance test, but after upgrading the worksration to Ubuntu 22.04 LTS (jammy), this stopped working be... — committed to noris-network/puppet-exim by mleiner 2 years ago
- Fix cgroup error Take a look at https://github.com/moby/moby/issues/42275 for a more detailed description on the error when running `docker run...` — committed to captain-proton/docker-manjaro-ansible by captain-proton a year ago
- Create script to run inbm without configuring cloud interface - includes workaround for https://github.com/moby/moby/issues/42275 — committed to intel/intel-inb-manageability by gblewis1 a year ago
It didn’t help. I’m running Ubuntu 21.10 (Impish Indri).
@skast96, it didn’t help either. I edited
/etc/docker/daemon.json
:Restarted
docker
. Thedockremap
user was created, as were the entries in/etc/sub{uid,gid}
. The/var/lib/docker/100000.100000
dir was created.docker image ls
produced no output. Then:So the only workaround is supposedly to switch to the cgroup v1 mode (
systemd.unified_cgroup_hierarchy=0
):/etc/default/grub
:update-grub
UPD And
--cgroupns=host
+-v /sys/fs/cgroup:/sys/fs/cgroup
(w/o:ro
), e.g.:I have discovered two additional workarounds for this issue that effectively retain all features of unified
cgroupv2
while maintaining security - no need for the--privileged
flag and no access to the root ofcgroupv2
hierarchy:--cgroupns host
Docker option and acgroupv2
sub-hierarchy volume binding for the container. Here is an example command:Not perfect, next option is better IMO.
/sys/fs/cgroup
on the host without thensdelegate
mount option. Although there isn’t an explicit option to disablensdelegate
likenodiscard
fordiscard
(see link 1, link 2 for more information), there is a workaround. Simply run any container using Docker with the--cgroupns host
option and without anycgroup
volume bindings. For example:After implementing these steps, you can run a container with Docker using
--cgroupns private
flag and volume binding ofcgroupv2
sub-hierarchy. For example:Please note that the information provided above applies specifically to CentOS Stream release 9 with
kernel-ml-6.3.7-1.el9.elrepo
,systemd-252.4-598.13.hs.el9
(Hyperscale SIG) anddocker-ce-24.0.2-1
(systemd
cgroup driver) although may help with a wide range of different scenarios.Related: https://serverfault.com/questions/1053187/systemd-fails-to-run-in-a-docker-container-when-using-cgroupv2-cgroupns-priva/1054414#1054414
This is ok (mode is rw). However I assume that you obtained this result with userns-remapping.
I think that it should be possible to have the same result without such daemon option, with the proper modifications on the docker engine, like podman does.
Thanks. This helped me too in starting a docker container with systemd inside (Fedora 37 host with cgroupv2).
I needed to add to daemon.json (and create the dockuser user on the docker host):
I left out some of the options you used though:
I used this Dockerfile: (I created encrypted_password with mkpasswd -m sha512crypt ‘password’)
@x-yuri the docker approach is not working that great tbh. It is working with namespace isolation when creating a extra slice for docker and adding this slice to the
docker run
command like so:That kinda worked for me. However our other containers stopped working with namespace isolation because they were not configured for that. That meant to much work in order to run one container with systemd.
So I suggest you to just install
podman
. I experienced no drawbacks on my Arch Linux when having both docker and podman installed. Even the commands are the same. You would start your systemd container like that below with podman.For reference, it is possible with namespace isolation. https://docs.docker.com/engine/security/userns-remap/ Or simply install podman.
Same here It was working with 247