k3s: k3s container v1.22 and newer fails on docker-desktop and k3d clusters

Environmental Info: K3s Version:

k3s version v1.22.2+k3s2 (3f5774b4)
go version go1.16.8

Node(s) CPU architecture, OS, and Version:

Linux test-0 5.10.76-linuxkit #1 SMP Mon Nov 8 10:21:19 UTC 2021 x86_64 GNU/Linux

Cluster Configuration:

container k3s, single server, no agents

Describe the bug: Hello! Thanks a lot for the great project! I’m one of the maintainers of vcluster and we are using k3s as minimal control plane for our virtual cluster implementation. Unfortunately it seems like k3s stopped working for us since version v1.22 (essentially every version released after PR #4086), emitting the following error on docker-desktop, kind and k3s host clusters:

time="2022-01-06T10:30:21Z" level=fatal msg="failed to evacuate root cgroup: mkdir /sys/fs/cgroup/init: read-only file system"

It worked fine with earlier versions and works fine with vanilla k8s or k0s v1.22 containers.

We have a little bit of a special setup where we run k3s without agent and scheduler and I’m not sure what exactly is causing this error as it works on GKE for example, but would it be somehow possible to not run the root cgroup evacuation if agent is not enabled in order to have similar behaviour like in older versions? If not, is it possible to introduce a flag to disable this?

Steps To Reproduce:

  • Install docker-desktop, kind or equivalent v1.22 or higher host cluster
  • Create a new k3s container within that host cluster:
apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
    - args:
        - server
        - --write-kubeconfig=/data/k3s-config/kube-config.yaml
        - --disable=traefik,servicelb,metrics-server,local-storage,coredns
        - --disable-network-policy
        - --disable-agent
        - --disable-scheduler
        - --disable-cloud-controller
        - --flannel-backend=none
        - --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle
        - --service-cidr=10.96.0.0/12
      command:
        - /bin/k3s
      image: rancher/k3s:v1.22.2-k3s2
      name: k3s

This doesn’t work with v1.22 and newer, while it works with v1.21 (e.g. image rancher/k3s:v1.21.2-k3s1) and lower.

Expected behavior:

k3s container should be running without errors

Actual behavior:

k3s container fails with error:

time="2022-01-06T10:30:21Z" level=fatal msg="failed to evacuate root cgroup: mkdir /sys/fs/cgroup/init: read-only file system"

Backporting

  • [ x ] Needs backporting to older releases

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (9 by maintainers)

Commits related to this issue

Most upvoted comments

Not mucking about with cgroups when not running the kubelet seems reasonable; I’ll take a shot at that for the next patch release.

Validated in all of v1.20.15-rc1+k3s1, v1.21.9-rc1+k3s1, v1.22.6-rc1+k3s1, and v1.23.2-rc1+k3s1

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
    - args:
        - server
        - --write-kubeconfig=/data/k3s-config/kube-config.yaml
        - --disable=traefik,servicelb,metrics-server,local-storage,coredns
        - --disable-network-policy
        - --disable-agent
        - --disable-scheduler
        - --disable-cloud-controller
        - --flannel-backend=none
        - --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle
        - --service-cidr=10.96.0.0/12
      command:
        - /bin/k3s
      image: rancher/k3s:v1.22.5-k3s1
      name: k3s
---
apiVersion: v1
kind: Pod
metadata:
  name: test-120
spec:
  containers:
    - args:
        - server
        - --write-kubeconfig=/data/k3s-config/kube-config.yaml
        - --disable=traefik,servicelb,metrics-server,local-storage,coredns
        - --disable-network-policy
        - --disable-agent
        - --disable-scheduler
        - --disable-cloud-controller
        - --flannel-backend=none
        - --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle
        - --service-cidr=10.96.0.0/12
      command:
        - /bin/k3s
      image: rancher/k3s:v1.20.15-rc1-k3s1
      name: k3s
---
apiVersion: v1
kind: Pod
metadata:
  name: test-121
spec:
  containers:
    - args:
        - server
        - --write-kubeconfig=/data/k3s-config/kube-config.yaml
        - --disable=traefik,servicelb,metrics-server,local-storage,coredns
        - --disable-network-policy
        - --disable-agent
        - --disable-scheduler
        - --disable-cloud-controller
        - --flannel-backend=none
        - --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle
        - --service-cidr=10.96.0.0/12
      command:
        - /bin/k3s
      image: rancher/k3s:v1.21.9-rc1-k3s1
      name: k3s
---
apiVersion: v1
kind: Pod
metadata:
  name: test-122
spec:
  containers:
    - args:
        - server
        - --write-kubeconfig=/data/k3s-config/kube-config.yaml
        - --disable=traefik,servicelb,metrics-server,local-storage,coredns
        - --disable-network-policy
        - --disable-agent
        - --disable-scheduler
        - --disable-cloud-controller
        - --flannel-backend=none
        - --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle
        - --service-cidr=10.96.0.0/12
      command:
        - /bin/k3s
      image: rancher/k3s:v1.22.6-rc1-k3s1
      name: k3s
---
apiVersion: v1
kind: Pod
metadata:
  name: test-123
spec:
  containers:
    - args:
        - server
        - --write-kubeconfig=/data/k3s-config/kube-config.yaml
        - --disable=traefik,servicelb,metrics-server,local-storage,coredns
        - --disable-network-policy
        - --disable-agent
        - --disable-scheduler
        - --disable-cloud-controller
        - --flannel-backend=none
        - --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle
        - --service-cidr=10.96.0.0/12
      command:
        - /bin/k3s
      image: rancher/k3s:v1.23.2-rc1-k3s1
      name: k3s
  • All pods, other than the original test pod, are up and running successfully, as expected:
# kubectl get nodes,pods -A -o wide
NAME                             STATUS   ROLES                  AGE   VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE   KERNEL-VERSION            CONTAINER-RUNTIME
node/k3d-test-cluster-server-0   Ready    control-plane,master   13m   v1.22.5+k3s1   172.18.0.2    <none>        K3s dev    5.11.12-300.fc34.x86_64   containerd://1.5.8-k3s1

NAMESPACE     NAME                                         READY   STATUS             RESTARTS      AGE     IP           NODE                        NOMINATED NODE   READINESS GATES
kube-system   pod/coredns-85cb69466-qcs64                  1/1     Running            0             13m     10.42.0.4    k3d-test-cluster-server-0   <none>           <none>
kube-system   pod/local-path-provisioner-64ffb68fd-vkzzn   1/1     Running            0             13m     10.42.0.2    k3d-test-cluster-server-0   <none>           <none>
kube-system   pod/metrics-server-9cf544f65-w5fbw           1/1     Running            0             13m     10.42.0.3    k3d-test-cluster-server-0   <none>           <none>
kube-system   pod/helm-install-traefik-crd--1-jc8lh        0/1     Completed          0             13m     10.42.0.5    k3d-test-cluster-server-0   <none>           <none>
kube-system   pod/helm-install-traefik--1-kzcbp            0/1     Completed          2             13m     10.42.0.6    k3d-test-cluster-server-0   <none>           <none>
kube-system   pod/svclb-traefik-ft5p8                      2/2     Running            0             12m     10.42.0.7    k3d-test-cluster-server-0   <none>           <none>
kube-system   pod/traefik-786ff64748-fxj6f                 1/1     Running            0             12m     10.42.0.8    k3d-test-cluster-server-0   <none>           <none>
default       pod/test-122                                 1/1     Running            0             6m14s   10.42.0.14   k3d-test-cluster-server-0   <none>           <none>
default       pod/test-120                                 1/1     Running            0             6m14s   10.42.0.11   k3d-test-cluster-server-0   <none>           <none>
default       pod/test-123                                 1/1     Running            0             6m14s   10.42.0.12   k3d-test-cluster-server-0   <none>           <none>
default       pod/test-121                                 1/1     Running            0             6m14s   10.42.0.13   k3d-test-cluster-server-0   <none>           <none>
default       pod/test                                     0/1     CrashLoopBackOff   6 (46s ago)   6m14s   10.42.0.10   k3d-test-cluster-server-0   <none>           <none>
  • The original test pod has the expected error:
# k logs test
time="2022-01-24T18:29:11Z" level=fatal msg="failed to evacuate root cgroup: mkdir /sys/fs/cgroup/init: read-only file system"

@iwilltry42 thanks so much for your reply and investigation! Our use case is a little bit different from the default k3d setup and we do not run k3s in docker directly, but rather use an already existing k3d cluster, docker desktop or kind Kubernetes cluster to schedule a new limited k3s pod (basically just the data store, api server and controller manager, while everything else such as scheduler, agent etc. is disabled) in there. The problem then is that this pod fails to start (as k3s is trying to evacuate the cgroups on a read only file system and k3s runs in non privileged mode, which for our use case wouldn’t be necessary at all I guess), so its basically Kubernetes within Kubernetes instead of Kubernetes within docker. To reproduce the problem you can setup the k3d like you did and then schedule a pod in there like this, which should fail with the above error message (but mysteriously for some system this works as well as for example GKE or older docker desktop versions, which might not use cgroupsv2):

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
    - args:
        - server
        - --write-kubeconfig=/data/k3s-config/kube-config.yaml
        - --disable=traefik,servicelb,metrics-server,local-storage,coredns
        - --disable-network-policy
        - --disable-agent
        - --disable-scheduler
        - --disable-cloud-controller
        - --flannel-backend=none
        - --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle
        - --service-cidr=10.96.0.0/12
      command:
        - /bin/k3s
      image: rancher/k3s:v1.22.5-k3s1
      name: k3s

We then have an additional component that syncs created pods in that minimal control plane to the actual Kubernetes cluster which schedules those on then real nodes, but the created k3s pod itself is actually not able to schedule any pods as there are no real nodes joined. The advantage of this is that you essentially can split up the control plane and allow users access to a fully working Kubernetes cluster with CRDs, Webhooks, ClusterRoles etc., while the actual workloads are synced to the same namespace on the host cluster, which is great for multi-tenancy scenarios, where you would like to give different people limited access to the host Kubernetes cluster.

It’s a bit of a hack, but since cgroup evacuation only runs if k3s is pid 1, you could try running /bin/k3s from /bin/sh:

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
    - args:
        - -c
        - /bin/k3s server
          --write-kubeconfig=/data/k3s-config/kube-config.yaml
          --disable=traefik,servicelb,metrics-server,local-storage,coredns
          --disable-network-policy
          --disable-agent
          --disable-scheduler
          --disable-cloud-controller
          --flannel-backend=none
          --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle
          --service-cidr=10.96.0.0/12
          && true
      command:
        - /bin/sh
      image: rancher/k3s:v1.22.5-k3s1
      name: k3s

note that && true is necessary to prevent /bin/sh from just execing k3s which would leave it as pid 1 again.

Not sure if it helps, but let me just drop some info here:

  • k3d cluster create test-cluster --image rancher/k3s:v1.22.5-k3s1 works without problems for me
    • Docker 20.10.12 on Ubuntu 21.10 (kernel 5.15.8) with k3d v5.2.2
  • k3d runs K3s containers in privileged mode by default
  • k3d runs K3s containers with docker-init and a custom entrypoint (e.g. as mentioned for the cgroup evacuation):
    / # ps aux
    PID   USER     COMMAND
         1 0        /sbin/docker-init -- /bin/k3d-entrypoint.sh server --tls-san 0.0.0.0
         7 0        /bin/k3s server
       69 0        containerd 
    

UPDATE 1: Just tested with Docker for Desktop on Windows 10 without a problem 🤔

  • Docker v20.10.11 (DfD v4.3.2)
    • Kernel 5.10.76-linuxkit
    • cgroup2/cgroupfs
  • k3d v5.2.2
  • k3s v1.22.5-k3s1