sysbox: Not able to run `kind` inside of Sysbox based pod without cgroups v2

I didn’t find this as a known limitation in the docs, or maybe I didn’t check well enough.

On my CI pipelines, we often use kind for spinning up ephemeral clusters for testing purposes. When I try to execute kind inside of a Sysbox based pod, it fails with the following log:

$ kind create cluster -v 4
Creating cluster "kind" ...
DEBUG: docker/images.go:58] Image: kindest/node:v1.21.1@sha256:69860bda5563ac81e3c0057d654b5253219618a22ec3a346306239bba8cfa1a6 present locally
 ✓ Ensuring node image (kindest/node:v1.21.1) 🖼
 ✓ Preparing nodes 📦  
 ✗ Writing configuration 📜 
ERROR: failed to create cluster: failed to generate kubeadm config content: failed to get kubernetes version from node: failed to get file: command "docker exec --privileged kind-control-plane cat /kind/version" failed with error: exit status 1
Command Output: Error response from daemon: Container cd764937076909061269b29ad895740de3eb0c8c0299354678789564ab6276a9 is not running
Stack Trace: 
sigs.k8s.io/kind/pkg/errors.WithStack
        sigs.k8s.io/kind/pkg/errors/errors.go:59
sigs.k8s.io/kind/pkg/exec.(*LocalCmd).Run
        sigs.k8s.io/kind/pkg/exec/local.go:124
sigs.k8s.io/kind/pkg/cluster/internal/providers/docker.(*nodeCmd).Run
        sigs.k8s.io/kind/pkg/cluster/internal/providers/docker/node.go:146
sigs.k8s.io/kind/pkg/exec.OutputLines
        sigs.k8s.io/kind/pkg/exec/helpers.go:81
sigs.k8s.io/kind/pkg/cluster/nodeutils.KubeVersion
        sigs.k8s.io/kind/pkg/cluster/nodeutils/util.go:35
sigs.k8s.io/kind/pkg/cluster/internal/create/actions/config.getKubeadmConfig
        sigs.k8s.io/kind/pkg/cluster/internal/create/actions/config/config.go:208
sigs.k8s.io/kind/pkg/cluster/internal/create/actions/config.(*Action).Execute.func1.1
        sigs.k8s.io/kind/pkg/cluster/internal/create/actions/config/config.go:90
sigs.k8s.io/kind/pkg/errors.UntilErrorConcurrent.func1
        sigs.k8s.io/kind/pkg/errors/concurrent.go:30
runtime.goexit
        runtime/asm_amd64.s:1371

The following is what the dockerd (inside of the pod) logs says during the failure:

time="2021-10-09T05:10:23.896709468Z" level=info msg="starting signal loop" namespace=moby path=/run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/7130619b171232e6d798c2f3142048550bbe8875800bd4b31bda3528742074bf pid=49556
INFO[2021-10-09T05:10:24.639521490Z] ignoring event                                container=7130619b171232e6d798c2f3142048550bbe8875800bd4b31bda3528742074bf module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
INFO[2021-10-09T05:10:24.639540842Z] shim disconnected                             id=7130619b171232e6d798c2f3142048550bbe8875800bd4b31bda3528742074bf
ERRO[2021-10-09T05:10:24.639608724Z] copy shim log                                 error="read /proc/self/fd/14: file already closed"
time="2021-10-09T05:10:24.974458264Z" level=info msg="starting signal loop" namespace=moby path=/run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/7130619b171232e6d798c2f3142048550bbe8875800bd4b31bda3528742074bf pid=49847
INFO[2021-10-09T05:10:25.674662462Z] ignoring event                                container=7130619b171232e6d798c2f3142048550bbe8875800bd4b31bda3528742074bf module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
INFO[2021-10-09T05:10:25.674823632Z] shim disconnected                             id=7130619b171232e6d798c2f3142048550bbe8875800bd4b31bda3528742074bf
ERRO[2021-10-09T05:10:25.674876910Z] copy shim log                                 error="read /proc/self/fd/14: file already closed"
ERRO[2021-10-09T05:10:26.059520310Z] Error setting up exec command in container kind-control-plane: Container 7130619b171232e6d798c2f3142048550bbe8875800bd4b31bda3528742074bf is not running 

More information:

# this is inside of the pod

jenkins@dind:~$ kind --version
kind version 0.11.1

jenkins@dind:~$ docker version
Client: Docker Engine - Community
 Version:           20.10.9
 API version:       1.41
 Go version:        go1.16.8
 Git commit:        c2ea9bc
 Built:             Mon Oct  4 16:08:29 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.9
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.8
  Git commit:       79ea9d3
  Built:            Mon Oct  4 16:06:37 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.11
  GitCommit:        5b46e404f6b9f661a205e28d59c982d3634148f8
 runc:
  Version:          1.0.2
  GitCommit:        v1.0.2-0-g52b36a2
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
jenkins@dind:~$ docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.6.3)
  compose: Docker Compose (Docker Inc., v2.0.1)
  scan: Docker Scan (Docker Inc., v0.8.0)

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 1
 Server Version: 20.10.9
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runtime.v1.linux runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 5b46e404f6b9f661a205e28d59c982d3634148f8
 runc version: v1.0.2-0-g52b36a2
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.4.0-70-generic
 Operating System: Ubuntu 20.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.34GiB
 Name: dind
 ID: 7FEQ:W3EN:IIFL:RIRP:4UXU:OUMB:KVLK:MOD7:MXEE:4QYK:3OA6:HPM3
 Docker Root Dir: /home/jenkins/agent/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

On the node:

$ sysbox-runc --version
sysbox-runc
	edition: 	Community Edition (CE)
	version: 	0.4.1
	commit: 	f3af483374ba58e9f09d97cfc19bfff0aa9796cc
	built at: 	Sat Oct  9 00:07:07 UTC 2021
	built by: 	Rodny Molina
	oci-specs: 	1.0.2-dev

Which is the version built at https://github.com/nestybox/sysbox/issues/406#issuecomment-939190249.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 29 (26 by maintainers)

Most upvoted comments

Hi @felipecrs:

Reading https://rootlesscontaine.rs/getting-started/common/cgroup2/, I think it would not be a so good idea to try to enable cgroupv2 in my Ubuntu 18.04 nodes. They recommend to have a systemd version of 244, while it’s 237 in Ubuntu 18.04.

This may be fine, let me explain.

The configuration of cgroups for a container (either cgroups v1 or v2) can be done by having the container runtime directly program the cgroup filesystem (e.g., /sys/fs/cgroup), or by having the container runtime request systemd to manage the cgroup filesystem on its behalf. These are known as the “cgroupfs” and “systemd” cgroup drivers respectively.

In general the systemd cgroup driver approach is preferred, because it creates a single entity in the host managing the cgroups (i.e. systemd). But the cgroupfs driver works fine too.

Currently, when you install Sysbox on a Kubernetes cluster with sysbox-deploy-k8s, it also installs CRI-O as the runtime and configures it with the cgroupfs driver. In other words, systemd is not managing cgroups for the containers (though it does still manage cgroups for systemd services).

The fact that systemd is not manging the cgroups for the containers, coupled with the fact that systemd v244 is only needed for cgroup delegation (e.g., to allow containers to manage a cgroup subhierarchy), means that you should be able to configure cgroups v2 on you Ubuntu 18.04 hosts without problem.

In the near future, we will likely add logic to the sysbox-deploy-k8s to determine the version of systemd on the host and based on this select the best cgroup driver (e.g., cgroupfs or systemd). For hosts that carry systemd > v244, we would enable the systemd driver. Otherwise we would keep the cgroupfs approach.

Hope this clarifies.

Awesome. Yes, agreed.

Hi @felipecrs,

FYI, we updated the nestybox/kindestnode images to relax the cgroup v2 requirement. See the Dockerfile here.

This is a temporary work-around while we work to relax the cgroup v2 check on the official kind images (i.e., kindest/node).

Given that the cgroup v2 requirement when running in a user-ns came from KinD, coupled with the fact that KinD is about to update the official image to relax this requirement, plus the work-around described above, I’ll close this issue.

Please re-open if you disagree.

Got it. Thanks!

Hi @felipecrs:

That steps to enable cgroup v2 on your host look fine.

I intentionally skipped “Enabling CPU, CPUSET, and I/O delegation” because I believe Sysbox won’t require it as Sysbox itself runs as root. I would appreciate your feedback in this decision.

Since you have systemd < v244, it’s best not to enable cpu/cpuset/io delegation. In any case, Sysbox won’t use it right now because it will manage the cgroups v2 directly via /sys/fs/cgroup (i.e., cgroupfs driver).

Once you have a host with sytemd >= v244, then you should enable cpu/cpuset/io delegation. This way, Sysbox may use the systemd cgroup driver (the preferred approach going forward).

Hope that helps!

@ctalledo thanks a bunch for the elaborated answer. I went ahead and enabled cgroupv2 in one of my nodes for testing purpose. To enable it, I did:

  1. Add systemd.unified_cgroup_hierarchy=1 to /etc/default/grub in GRUB_CMDLINE_LINUX
  2. sudo update-grub
  3. sudo reboot

I intentionally skipped “Enabling CPU, CPUSET, and I/O delegation” because I believe Sysbox won’t require it as Sysbox itself runs as root. I would appreciate your feedback in this decision.

And now, as expected, kind create cluster is working. I’ll just have to confirm with my IT if I’m allowed to do this change (because otherwise it would get reverted in the next scheduled system patch, even though I have root access on the node).


@rodnymolina I continued the discussion in the PR suggestion.

@aojea, thanks for joining our conversation, appreciate your feedback …

IMHO, the problem here is that KinD is coupling the semantics of rootless with the one of unprivileged containers. A runtime could require full privileges to operate (i.e. run as root), and yet, be able to generate unprivileged containers through the utilization of user namespaces. This is the model in which Sysbox operates – similar to how docker does it when running in userns-remap mode, or how LXC has done it since day one when creating unprivileged containers.

That’s to say that KinD’s root-init-userns detection logic seems correct to me, the problem I see is with the enforcement of cgroup-v2 when user-ns is active. Sysbox is capable of enforcing cgroup-v1 limits, so I don’t see why cgroup-v2 must be required when user-ns are detected.

Ok, it works with kind v0.10.0. It must have something to do with their adoption to cgroups v2, as reference at here and here.

$ kind create cluster
Creating cluster "kind" ...
⠈⠁ Ensuring node image (kindest/node:v1.20.2) 🖼 WARN[2021-10-09T05:47:47.292449520Z] reference for unknown type:                   digest="sha256:8f7ea6e7642c0da54f04a7ee10431549c0257315b3a634f6ef2fecaaedb19bab" remote="docker.io/kindest/node@sha256:8f7ea6e7642c0da54f04a7ee10431549c0257315b3a634f6ef2fecaaedb19bab"
 ✓ Ensuring node image (kindest/node:v1.20.2) 🖼 
⢆⡱ Preparing nodes 📦  time="2021-10-09T05:48:21.718270811Z" level=info msg="starting signal loop" namespace=moby path=/run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/7df7dd4c50f1fc346be29aeb4f1a8723a9613848539421d86d83e8374f53ec21 pid=17255
 ✓ Preparing nodes 📦  
 ✓ Writing configuration 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Thanks for using kind! 😊

But it has nothing to do with the K8s version though.

Ok, a better insight:

jenkins@dind:~$ docker run kindest/node:v1.20.7@sha256:cbeaf907fc78ac97ce7b625e4bf0de16e3ea725daf6b04f930bd14c67c671ff9
time="2021-10-09T05:37:41.783364868Z" level=info msg="starting signal loop" namespace=moby path=/run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/192bbf079c6ad7f6e3be91eb4de17e0ac903b9a7fa5bd1587ac8051c4e303650 pid=9647
INFO: running in a user namespace (experimental)
ERROR: UserNS: cgroup v2 needs to be enabled
INFO[2021-10-09T05:37:42.237079084Z] shim disconnected                             id=192bbf079c6ad7f6e3be91eb4de17e0ac903b9a7fa5bd1587ac8051c4e303650
INFO[2021-10-09T05:37:42.237111730Z] ignoring event                                container=192bbf079c6ad7f6e3be91eb4de17e0ac903b9a7fa5bd1587ac8051c4e303650 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
ERRO[2021-10-09T05:37:42.237135275Z] copy shim log                                 error="read /proc/self/fd/14: file already closed"

The logs are mixed with my dockerd, sorry. The ones which starts with ERRO[ or INFO[ comes from dockerd.