microk8s: Microk8s v1.29 snap installation failed on plain Debian 12.4

Summary

The last days I noticed that the installation of MicroK8s v1.29/stable (6364) failed on a new (plain) Debian 12.4 system (tested on AWS EC2 with default Debian 12 image provided by AWS). After a few tests I can summarize the following behavior:

admin@ip-172-31-16-112:~$ microk8s status
microk8s is not running. Use microk8s inspect for a deeper inspection.
admin@ip-172-31-16-112:~$ microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
sudo: unable to resolve host ip-172-31-16-112: Name or service not known
sudo: unable to resolve host ip-172-31-16-112: Name or service not known
sudo: unable to resolve host ip-172-31-16-112: Name or service not known
sudo: unable to resolve host ip-172-31-16-112: Name or service not known
sudo: unable to resolve host ip-172-31-16-112: Name or service not known
sudo: unable to resolve host ip-172-31-16-112: Name or service not known
Inspecting dqlite
  Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6364/var/kubernetes/backend/localnode.yaml': No such file or directory

Building the report tarball
  Report tarball is at /var/snap/microk8s/6364/inspection-report-20240110_102300.tar.gz

microk8s_1.29_6364-inspection-report-20240110_102300.tar.gz

  • Refreshing the v1.28 (6089) instance to v1.29 (6364) works at the first glance, but the inspect looks not well:
admin@ip-172-31-18-155:~$ microk8s kubectl get all -A
NAMESPACE     NAME                                         READY   STATUS    RESTARTS   AGE
kube-system   pod/coredns-864597b5fd-k7hvt                 1/1     Running   0          2m29s
kube-system   pod/calico-kube-controllers-77bd7c5b-fp4zd   1/1     Running   0          2m29s

NAMESPACE     NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
default       service/kubernetes   ClusterIP   10.152.183.1    <none>        443/TCP                  2m35s
kube-system   service/kube-dns     ClusterIP   10.152.183.10   <none>        53/UDP,53/TCP,9153/TCP   2m32s

NAMESPACE     NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   daemonset.apps/calico-node   1         1         1       1            1           kubernetes.io/os=linux   2m34s

NAMESPACE     NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns                   1/1     1            1           2m32s
kube-system   deployment.apps/calico-kube-controllers   1/1     1            1           2m34s

NAMESPACE     NAME                                               DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-864597b5fd                 1         1         1       2m29s
kube-system   replicaset.apps/calico-kube-controllers-77bd7c5b   1         1         1       2m29s

admin@ip-172-31-18-155:~$ microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
Inspecting dqlite
  Inspect dqlite

Building the report tarball
  Report tarball is at /var/snap/microk8s/6364/inspection-report-20240110_103926.tar.gz

microk8s-1.28_6089-refreshed-1.29_6364-inspection-report-20240110_103926.tar.gz

  • The most strange thing is when I removed the MicroK8s package via sudo snap remove --purge microk8s and install the v1.29 (6364) again, the (one node) cluster seems to work like expected, but the inspect looks also not well:
admin@ip-172-31-18-155:~$ microk8s kubectl get all -A
NAMESPACE     NAME                                         READY   STATUS    RESTARTS   AGE
kube-system   pod/calico-node-bggsw                        1/1     Running   0          106s
kube-system   pod/coredns-864597b5fd-wzdz9                 1/1     Running   0          105s
kube-system   pod/calico-kube-controllers-77bd7c5b-vlk94   1/1     Running   0          105s

NAMESPACE     NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
default       service/kubernetes   ClusterIP   10.152.183.1    <none>        443/TCP                  111s
kube-system   service/kube-dns     ClusterIP   10.152.183.10   <none>        53/UDP,53/TCP,9153/TCP   109s

NAMESPACE     NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   daemonset.apps/calico-node   1         1         1       1            1           kubernetes.io/os=linux   111s

NAMESPACE     NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns                   1/1     1            1           110s
kube-system   deployment.apps/calico-kube-controllers   1/1     1            1           111s

NAMESPACE     NAME                                               DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-864597b5fd                 1         1         1       106s
kube-system   replicaset.apps/calico-kube-controllers-77bd7c5b   1         1         1       106s

admin@ip-172-31-18-155:~$ microk8s inspect
Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
sudo: unable to resolve host ip-172-31-18-155: Name or service not known
Inspecting dqlite
  Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6364/var/kubernetes/backend/localnode.yaml': No such file or directory

Building the report tarball
  Report tarball is at /var/snap/microk8s/6364/inspection-report-20240110_104641.tar.gz

microk8s-reinstall-1.29_6364-inspection-report-20240110_104641.tar.gz.tar.gz

What Should Happen Instead?

I hope somebody of the development team can find the reason for this behavior. I guess there is something installed on the host system during the v1.28 installation what failed in v1.29, and is not removed during snap remove --purge process.

Reproduction Steps

Explained above (incl. inspection tar balls)

If there are any points left, I will try to answer your questions. Thanks!

About this issue

  • Original URL
  • State: open
  • Created 6 months ago
  • Reactions: 2
  • Comments: 16 (3 by maintainers)

Most upvoted comments

Let me drop some news to this issue we have found, regarding our initially mentioned problem. Maybe somebody else can explain more about the findings we have made.

It might be an issue with the used kernel 6.1 on Debian 12 (last try with latest version 6.1.69). When we upgraded the kernel to 6.5.10 manually, we could install Microk8s 1.29 latest/edge (6469) without problems and all expected pods came up properly.

Let me attach the inspect files just to compare if required: Kernel 6.1.69: debian12.4_kernel6.1.69-1_inspection-report-20240130_125914.tar.gz Kernel 6.5.10: debian12.4_kernel6.5.10-1~bpo12+1_inspection-report-20240130_130629.tar.gz

Additionally (with link to @neoaggelos detail information) we have also figured out the reason in Kernel 6.1.x might be a deligation issue. If we add the following before we install MicroK8s, the initial problem does not occur.

# mkdir -p /etc/systemd/system/user@.service.d
# cat > /etc/systemd/system/user@.service.d/delegate.conf << EOF
[Service]
Delegate=cpu cpuset io memory pids
EOF
# systemctl daemon-reload

github - opencontainers - cgroupv2 Let me also attach the inspect files with these settings: debian12.4_kernel6.1.69-1_inspection-report-20240130_135728.tar.gz

Hi @TecIntelli and other folks who are running into this, sorry for taking long to check this.

This seems to be related with cgroups, I see the following in the error logs (and I can also reproduce in Debian 12 systems)

Jan 10 10:16:39 ip-172-31-16-112 microk8s.daemon-kubelite[8441]: E0110 10:16:39.649969    8441 kubelet.go:1542] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"
Jan 10 10:16:39 ip-172-31-16-112 systemd[1]: snap.microk8s.daemon-kubelite.service: Main process exited, code=exited, status=1/FAILURE
Jan 10 10:16:39 ip-172-31-16-112 systemd[1]: snap.microk8s.daemon-kubelite.service: Failed with result 'exit-code'.
Jan 10 10:16:39 ip-172-31-16-112 systemd[1]: snap.microk8s.daemon-kubelite.service: Consumed 6.137s CPU time.
Jan 10 10:16:39 ip-172-31-16-112 systemd[1]: snap.microk8s.daemon-kubelite.service: Scheduled restart job, restart counter is at 1.
Jan 10 10:16:39 ip-172-31-16-112 systemd[1]: Stopped snap.microk8s.daemon-kubelite.service - Service for snap application microk8s.daemon-kubelite.
Jan 10 10:16:39 ip-172-31-16-112 systemd[1]: snap.microk8s.daemon-kubelite.service: Consumed 6.137s CPU time.
Jan 10 10:16:39 ip-172-31-16-112 systemd[1]: Started snap.microk8s.daemon-kubelite.service - Service for snap application microk8s.daemon-kubelite.

One work-around for this is to disable this on the kubelet with:

echo '
--cgroups-per-qos=false
--enforce-node-allocatable=""
' | sudo tee -a /var/snap/microk8s/current/args/kubelet

sudo snap restart microk8s.daemon-kubelite

Afterwards, MicroK8s should be coming up. We will take this back to see what the root cause is and what sort of mitigations we could apply to prevent this in out of the box deployments.

Have the same problem on ubuntu server 22.04. snap install microk8s --classic --channel=1.29/stable

file localnode.yaml not exist

microk8s inspect

Inspecting dqlite
  Inspect dqlite
cp: cannot stat '/var/snap/microk8s/6364/var/kubernetes/backend/localnode.yaml': No such file or directory

I spontaneously run into the same issue on a HA cluster running Ubuntu 22.04 LTS (Hetzner cloud server) and Microk8s 1.29/stable. Firstly, I spotted weird behavior on one faulty node of the HA cluster (container stayed in Terminating state, no deletion possible). After rebooting, I observed that microk8s status output was flaky alternating between proper status reports, “not running” messages, and “random” execution errors. No issues were reported when running microk8s inspect.

At some point I realized that journalctl -f -u snap.microk8s.daemon-kubelite is logging too much with some hidden errors in between. It took me a while to understand that microk8s.daemon-kubelite is actually not starting (which was sadly not reflected by microk8s inspect):

Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubepods] doesn't exist"

After setting up a new clean machine Ubuntu 22.04.3 LTS and 1.29/stable (single node), I run into the same not starting microk8s.daemon-kubelite. On top I got the missing localnode.yaml error reported by @Zvirovyi earlier.

For now, I managed to restore the cluster by downgrading Microk8s to v1.28.3:

snap refresh microk8s --classic --channel=1.28/stable

PS: Adding and removing nodes from the HA cluster was very smooth in every stage, even with the “broken” 1.29/stable. Kudos to the maintainers!

Hi @TecIntelli thanks a lot for looking deeper and coming up with a path towards a solution. It is still not too clear to me how we could handle this on the MicroK8s side, I do not think it’s a good approach to mess with the system like this.

For me… snap remove --purge microk8s snap install microk8s --classic --channel=1.29/stable.

root@microk8s-master:~# microk8s status microk8s is not running. Use microk8s inspect for a deeper inspection.

The inspect log looks same as your first inspect output.

I don know Whats going on.