podman: After podman 2 upgrade, systemd fails to start in containers on cgroups v1 hosts

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

I was repeatedly building working containers with podman this morning when my OS (Ubuntu 20.04) notified me that podman 2.0 was available and I elected to install it.

Shortly afterword, I can no longer SSH to a newly build and launched container. I see this as output to podman container list -a:

CONTAINER ID  IMAGE                        COMMAND                                       CREATED         STATUS             PORTS                                             NAMES
0e7692779754  k8s.gcr.io/pause:3.2                                                       21 seconds ago  Up 17 seconds ago  127.0.0.1:2222->22/tcp, 127.0.0.1:3000->3000/tcp  505f2a3b385a-infra
537b8ed4db9c  localhost/devenv-img:latest  -c exec /sbin/init --log-target=journal 3>&1  20 seconds ago  Up 17 seconds ago                                                    devenv

This is frustrating: I don’t any references to a container named “pause”, yet one is running and listening on the ports my container had published, yet my container isn’t listening on any ports at all.

I read the podman 2.0 release notes and don’t see any notes about a related breaking change.

I did search the project for references to “infra containers” because I sometimes see that term mentioned in error messages. I find references to “infra containers” in the code, but I can’t find references in the documentation.

They seem related to this issue and it would be great if there was more accessible user documentation about “infra containers”

Steps to reproduce the issue:

podman run --systemd=always -it -p “127.0.0.1:2222:22” solita/ubuntu-systemd-ssh

Describe the results you received:

Initializing machine ID from random generator. Failed to create /user.slice/user-1000.slice/session-8.scope/init.scope control group: Permission denied Failed to allocate manager object: Permission denied [!!!] Failed to allocate manager object.

Describe the results you expected:

For this test, the container should boot to the point where this line appears:

  [  OK  ] Reached target Multi-User System.

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

podman version 2.0.0

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.15.0
  cgroupVersion: v1
  conmon:
    package: 'conmon: /usr/libexec/podman/conmon'
    path: /usr/libexec/podman/conmon
    version: 'conmon version 2.0.18, commit: '
  cpus: 4
  distribution:
    distribution: ubuntu
    version: "20.04"
  eventLogger: file
  hostname: mark-x1
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.4.0-37-generic
  linkmode: dynamic
  memFree: 1065062400
  memTotal: 16527003648
  ociRuntime:
    name: runc
    package: 'containerd.io: /usr/bin/runc'
    path: /usr/bin/runc
    version: |-
      runc version 1.0.0-rc10
      commit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
      spec: 1.0.1-dev
  os: linux
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  rootless: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: 'slirp4netns: /usr/bin/slirp4netns'
    version: |-
      slirp4netns version 1.0.0
      commit: unknown
      libslirp: 4.2.0
  swapFree: 19345408
  swapTotal: 1027600384
  uptime: 72h 32m 43.91s (Approximately 3.00 days)
registries:
  search:
  - docker.io
  - quay.io
store:
  configFile: /home/mark/.config/containers/storage.conf
  containerStore:
    number: 2
    paused: 0
    running: 2
    stopped: 0
  graphDriverName: vfs
  graphOptions: {}
  graphRoot: /home/mark/.local/share/containers/storage
  graphStatus: {}
  imageStore:
    number: 122
  runRoot: /run/user/1000/containers
  volumePath: /home/mark/.local/share/containers/storage/volumes
version:
  APIVersion: 1
  Built: 0
  BuiltTime: Wed Dec 31 19:00:00 1969
  GitCommit: ""
  GoVersion: go1.13.8
  OsArch: linux/amd64
  Version: 2.0.0

Package info (e.g. output of rpm -q podman or apt list podman):

podman/unknown,now 2.0.0~1 amd64 [installed]

Additional environment details (AWS, VirtualBox, physical, etc.):

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 148 (134 by maintainers)

Commits related to this issue

abi: fix detection for systemd create a scope everytime we don't own the current cgroup and we are running on systemd, regardless of the config manager specified. Closes: https://github.com/containe... — committed to giuseppe/libpod by giuseppe 4 years ago
abi: fix detection for systemd create a scope everytime we don't own the current cgroup and we are running on systemd. Closes: https://github.com/containers/podman/issues/6734 Signed-off-by: Giusep... — committed to giuseppe/libpod by giuseppe 4 years ago
abi: fix detection for systemd create a scope everytime we don't own the current cgroup and we are running on systemd. Closes: https://github.com/containers/podman/issues/6734 Signed-off-by: Giusep... — committed to giuseppe/libpod by giuseppe 4 years ago
abi: fix detection for systemd create a scope everytime we don't own the current cgroup and we are running on systemd. Closes: https://github.com/containers/podman/issues/6734 Signed-off-by: Giusep... — committed to giuseppe/libpod by giuseppe 4 years ago
abi: fix detection for systemd create a scope everytime we don't own the current cgroup and we are running on systemd. Closes: https://github.com/containers/podman/issues/6734 Signed-off-by: Giusep... — committed to mheon/libpod by giuseppe 4 years ago
Workaround https://github.com/containers/podman/issues/6734. — committed to adelton/freeipa-container by adelton 4 years ago
Workaround https://github.com/containers/podman/issues/6734. — committed to adelton/freeipa-container by adelton 4 years ago
Workaround https://github.com/containers/podman/issues/6734. — committed to adelton/freeipa-container by adelton 4 years ago
With podman 2.0.6, rootless systemd container work even without systemd-run. Presumably fixed by https://github.com/containers/podman/pull/7339 for https://github.com/containers/podman/issues/6734. — committed to adelton/freeipa-container by adelton 4 years ago
With podman 2.0.6, rootless systemd container work even without systemd-run. Presumably fixed by https://github.com/containers/podman/pull/7339 for https://github.com/containers/podman/issues/6734. — committed to adelton/freeipa-container by adelton 4 years ago
With podman 2.0.6, rootless systemd container work even without systemd-run. Presumably fixed by https://github.com/containers/podman/pull/7339 for https://github.com/containers/podman/issues/6734. — committed to adelton/freeipa-container by adelton 4 years ago
abi: fix detection for systemd create a scope everytime we don't own the current cgroup and we are running on systemd. Closes: https://github.com/containers/podman/issues/6734 Signed-off-by: Giusep... — committed to edsantiago/libpod by giuseppe 4 years ago
Run molecule tests as root Using rootless podman on CentOS 8.3 is failing for us, due to [1][2]. We could try to force molecule to use "systemd-run --user --scope podman" instead of "podman", but run... — committed to rdo-infra/ansible-role-dlrn by javierpena 4 years ago

Most upvoted comments

@c-goes - see https://github.com/containers/podman/issues/7441. I think you’re hitting that.

dustymabe on Aug 27, 2020

if reverting (#6569) solves your issue, you can force a new scope wrapping podman with systemd-run as systemd-run --user --scope podman ....

In your case it will be: systemd-run --user --scope podman run -it jrei/systemd-ubuntu:16.04

giuseppe on Aug 11, 2020

OK that is just really cool. This is how I would implement.

As root: (likely wrap this up in a systemd unit to run at boot time, maybe systemd-cgv1.service)

mkdir /sys/fs/cgroup/systemd
mount -t cgroup cgroup -o none,name=systemd,xattr /sys/fs/cgroup/systemd

And wrap this up in something like systemd-cgv1-user@.service:

mkdir -P /sys/fs/cgroup/systemd/user.slice/user-1000.slice
chown -R 1000.1000 /sys/fs/cgroup/systemd/user.slice/user-1000.slice

Since we’re making our own convention it could even use username instead of uid…

Note this essentially creates a kinda unified hybrid… Host systemd doesn’t know about the v1 cgroup, which I suppose is fine.

$ cat /proc/self/cgroup
1:name=systemd:/
0::/user.slice/user-1000.slice/session-16.scope

So, as user - add this to your .bash_profile (or similar):

echo $BASHPID > "/sys/fs/cgroup/systemd/user.slice/user-$UID.slice/cgroup.procs"

And then you’re good to go. podman run --rm -it --annotation run.oci.systemd.force_cgroup_v1=/sys/fs/cgroup jrei/systemd-ubuntu:16.04

That works for me. That’s pretty cool @giuseppe

goochjj on Jul 3, 2020

I wasn’t aware of the issue as I could have documented it better, but it seems that it is necessary to mount the named hierarchy in the host first:

# mkdir /sys/fs/cgroup/systemd && mount -t cgroup cgroup -o none,name=systemd,xattr /sys/fs/cgroup/systemd

also please enforce the systemd mode with --systemd always unless your init binary is /sbin/init or systemd

If you create a subcgroup like:

# mkdir /sys/fs/cgroup/systemd/1000
# chown 1000:1000 mkdir /sys/fs/cgroup/systemd/1000
# echo $ROOTLESS_TERMINAL_PROCESS_PID > /sys/fs/cgroup/systemd/1000/cgroup.procs

you’ll be able to use the feature also as rootless (user 1000 assumed in my example above).

giuseppe on Jul 2, 2020

You need to be root. It is a privileged operation

giuseppe on Jul 2, 2020

Known Workarounds

Enable cgroupsv2. Update /etc/default/grub to add systemd.unified_cgroup_hierarchy=1 to GRUB_CMDLINE_LINUX_DEFAULT and follow standard procedures to update Grub and reboot.

Don’t use images with old versions of systemd, such as Ubuntu 16.04.

But to be clear, you tested this on Cgroups V2, and podman 1.9.3 can’t run Systemd from Ubuntu 16.04 on Cgroups V2 either.

Did you test 16.04 on cgroups 1?

I show on Cgroups v1, you CAN run 16.04 images and old systemd images. It requires the same workarounds as before, systemd-run or whatever, to work on podman v2.

So ultimately.

Enable cgroupsv2. In your case, you do that with grub config, in FCOS, use kargs, etc. In F32, it’s already default. Realize when you enable Cgroups V2, you lose the ability to run OLD systemd, in any form, in any container. Unless @giuseppe 's patch to crun, dev, allows this. That said, you can happily use 16.04 for whatever you want. The limitation is that systemd didn’t get anything involving cgroups v2 until at least v233.
Launch your podman in a proper systemd scope, using systemd-run --scope --user or systemd-run --user -P
Manually specify cgroup-manager systemd:

(focal)mrwizard@FocalCG1Dev:~/src/podman
$ podman run --rm -it --cgroup-manager systemd jrei/systemd-ubuntu:16.04
systemd 229 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN)
Detected virtualization container-other.
Detected architecture x86-64.

Welcome to Ubuntu 16.04.6 LTS!

Set hostname to <91b5f3e21931>.
Failed to read AF_UNIX datagram queue length, ignoring: No such file or directory
Failed to install release agent, ignoring: Permission denied
[  OK  ] Created slice System Slice.
[  OK  ] Listening on Journal Socket.
[  OK  ] Reached target Slices.

If you want to be able to do sshd without systemd… you can. podman --init --rm -it ubuntu:16.04 sh -c 'apt-get update && apt-get install -y openssh-server && /usr/sbin/sshd -D'

Obviously you’d do that in your build, not on the command line. You could also use podman exec instead of ssh. We don’t know your use case, so we can’t advise better.

NOW, if someone who knows more than me wants to figure out WHY --cgroup-manager systemd is required, why podman isn’t automatically detecting that as an option or using it… those are actionable.

I can also say that setting

[engine]

# Cgroup management implementation used for the runtime.
# Valid options “systemd” or “cgroupfs”
#
cgroup_manager = "systemd"

Makes no difference, and the only warning I see cgroup related is: time=“2020-07-02T13:51:15-04:00” level=warning msg=“Failed to add conmon to cgroupfs sandbox cgroup: error creating cgroup for blkio: mkdir /sys/fs/cgroup/blkio/libpod_parent: permission denied”

Which isn’t entirely unexpected on CGv1. But the big question is - WHY is podman using cgroupfs?

goochjj on Jul 2, 2020

Cgroups output when running podman from console:

12:hugetlb:/
11:freezer:/
10:blkio:/user.slice
9:cpu,cpuacct:/user.slice
8:perf_event:/
7:memory:/user.slice/user-1000.slice/session-232.scope
6:cpuset:/
5:pids:/user.slice/user-1000.slice/session-232.scope
4:devices:/user.slice
3:rdma:/
2:net_cls,net_prio:/
1:name=systemd:/user.slice/user-1000.slice/session-232.scope
0::/user.slice/user-1000.slice/session-232.scope

File system permissions:

total 0
drwxr-xr-x  6 root    root    0 kesä   28 16:27 .
drwxr-xr-x  3 root    root    0 kesä   28 16:27 ..
-rw-r--r--  1 root    root    0 kesä   28 20:18 cgroup.clone_children
-rw-r--r--  1 root    root    0 kesä   28 20:18 cgroup.procs
-rw-r--r--  1 root    root    0 kesä   28 20:18 notify_on_release
drwxr-xr-x  2 root    root    0 kesä   28 20:10 session-232.scope
drwxr-xr-x  2 root    root    0 kesä   28 20:18 session-2.scope
-rw-r--r--  1 root    root    0 kesä   28 20:18 tasks
drwxr-xr-x 44 someuser someuser 0 kesä   28 20:04 user@1000.service
drwxr-xr-x  2 root    root    0 kesä   28 20:18 user-runtime-dir@1000.service

skorhone on Jun 28, 2020

I think I figured out why we are seeing different behavior. My cgroups looks following:

12:hugetlb:/
11:freezer:/
10:blkio:/user.slice
9:cpu,cpuacct:/user.slice
8:perf_event:/
7:memory:/user.slice/user-1000.slice/user@1000.service
6:cpuset:/
5:pids:/user.slice/user-1000.slice/user@1000.service
4:devices:/user.slice
3:rdma:/
2:net_cls,net_prio:/
1:name=systemd:/user.slice/user-1000.slice/user@1000.service/apps.slice/apps-org.gnome.Terminal.slice/vte-spawn-6773d329-9ee1-450e-ae44-d5e4810e64a2.scope/ef312c6459eb19dc7f99f918f0eb90c7a231c59f50e40e9141018baa13c48b35
0::/user.slice/user-1000.slice/user@1000.service/apps.slice/apps-org.gnome.Terminal.slice/vte-spawn-6773d329-9ee1-450e-ae44-d5e4810e64a2.scope

As you can see, I’m running my commands using gnome / xterm. Now, if I drop to console mode, I see exactly same behavior:

systemd v243.8-1.fc31 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified)
Detected virtualization podman.
Detected architecture x86-64.

Welcome to Fedora 31 (Container Image)!

Set hostname to <9d16995357a6>.
Initializing machine ID from random generator.
Failed to create /user.slice/user-1000.slice/session-232.scope/init.scope control group: Permission denied
Failed to allocate manager object: Permission denied
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...

skorhone on Jun 28, 2020

@skorhone Notice some differences in our different cases:

podman 1.9.3 working: Detected virtualization docker.
podman 2.0.0 not working: Detected virtualization container-other.
podman 2.0.0 working: Detected virtualization podman.

Could that be related?

markstos on Jun 27, 2020