k3s: k3s container v1.22 and newer fails on docker-desktop and k3d clusters
Environmental Info: K3s Version:
k3s version v1.22.2+k3s2 (3f5774b4)
go version go1.16.8
Node(s) CPU architecture, OS, and Version:
Linux test-0 5.10.76-linuxkit #1 SMP Mon Nov 8 10:21:19 UTC 2021 x86_64 GNU/Linux
Cluster Configuration:
container k3s, single server, no agents
Describe the bug: Hello! Thanks a lot for the great project! I’m one of the maintainers of vcluster and we are using k3s as minimal control plane for our virtual cluster implementation. Unfortunately it seems like k3s stopped working for us since version v1.22 (essentially every version released after PR #4086), emitting the following error on docker-desktop, kind and k3s host clusters:
time="2022-01-06T10:30:21Z" level=fatal msg="failed to evacuate root cgroup: mkdir /sys/fs/cgroup/init: read-only file system"
It worked fine with earlier versions and works fine with vanilla k8s or k0s v1.22 containers.
We have a little bit of a special setup where we run k3s without agent and scheduler and I’m not sure what exactly is causing this error as it works on GKE for example, but would it be somehow possible to not run the root cgroup evacuation if agent is not enabled in order to have similar behaviour like in older versions? If not, is it possible to introduce a flag to disable this?
Steps To Reproduce:
- Install docker-desktop, kind or equivalent v1.22 or higher host cluster
- Create a new k3s container within that host cluster:
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
containers:
- args:
- server
- --write-kubeconfig=/data/k3s-config/kube-config.yaml
- --disable=traefik,servicelb,metrics-server,local-storage,coredns
- --disable-network-policy
- --disable-agent
- --disable-scheduler
- --disable-cloud-controller
- --flannel-backend=none
- --kube-controller-manager-arg=controllers=*,-nodeipam,-nodelifecycle,-persistentvolume-binder,-attachdetach,-persistentvolume-expander,-cloud-node-lifecycle
- --service-cidr=10.96.0.0/12
command:
- /bin/k3s
image: rancher/k3s:v1.22.2-k3s2
name: k3s
This doesn’t work with v1.22 and newer, while it works with v1.21 (e.g. image rancher/k3s:v1.21.2-k3s1) and lower.
Expected behavior:
k3s container should be running without errors
Actual behavior:
k3s container fails with error:
time="2022-01-06T10:30:21Z" level=fatal msg="failed to evacuate root cgroup: mkdir /sys/fs/cgroup/init: read-only file system"
Backporting
- [ x ] Needs backporting to older releases
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 17 (9 by maintainers)
Commits related to this issue
- chore: bump k3s image version to pick a bug fix New version contain a fix for https://github.com/k3s-io/k3s/issues/4873 — committed to matskiv/vcluster by matskiv 2 years ago
Not mucking about with cgroups when not running the kubelet seems reasonable; I’ll take a shot at that for the next patch release.
Validated in all of v1.20.15-rc1+k3s1, v1.21.9-rc1+k3s1, v1.22.6-rc1+k3s1, and v1.23.2-rc1+k3s1
testpod, are up and running successfully, as expected:@iwilltry42 thanks so much for your reply and investigation! Our use case is a little bit different from the default k3d setup and we do not run k3s in docker directly, but rather use an already existing k3d cluster, docker desktop or kind Kubernetes cluster to schedule a new limited k3s pod (basically just the data store, api server and controller manager, while everything else such as scheduler, agent etc. is disabled) in there. The problem then is that this pod fails to start (as k3s is trying to evacuate the cgroups on a read only file system and k3s runs in non privileged mode, which for our use case wouldn’t be necessary at all I guess), so its basically Kubernetes within Kubernetes instead of Kubernetes within docker. To reproduce the problem you can setup the k3d like you did and then schedule a pod in there like this, which should fail with the above error message (but mysteriously for some system this works as well as for example GKE or older docker desktop versions, which might not use cgroupsv2):
We then have an additional component that syncs created pods in that minimal control plane to the actual Kubernetes cluster which schedules those on then real nodes, but the created k3s pod itself is actually not able to schedule any pods as there are no real nodes joined. The advantage of this is that you essentially can split up the control plane and allow users access to a fully working Kubernetes cluster with CRDs, Webhooks, ClusterRoles etc., while the actual workloads are synced to the same namespace on the host cluster, which is great for multi-tenancy scenarios, where you would like to give different people limited access to the host Kubernetes cluster.
It’s a bit of a hack, but since cgroup evacuation only runs if k3s is pid 1, you could try running /bin/k3s from /bin/sh:
note that
&& trueis necessary to prevent /bin/sh from just execing k3s which would leave it as pid 1 again.Not sure if it helps, but let me just drop some info here:
k3d cluster create test-cluster --image rancher/k3s:v1.22.5-k3s1works without problems for meprivilegedmode by defaultdocker-initand a custom entrypoint (e.g. as mentioned for the cgroup evacuation):UPDATE 1: Just tested with Docker for Desktop on Windows 10 without a problem 🤔