microk8s: Inconsistent control plane state

Summary

Something has gone wrong with (I think) the api-server or dqlite backend. Cluster state is inconsistent and replication controllers are not working.

I have 4 nodes (kube05-kube08) and earlier today I noticed that some of my workloads stopped. I saw that kube07 had some stuck processes and 100% CPU usage on it, so I attempted to drain it and reboot it. It never drained properly; all the pods that were on kube07 went to Unknown state and never got rescheduled.

I did microk8s leave on kube07 but all the kube07 resources remained on the cluster. I force-deleted all of those pods but they never got rescheduled. Now I re-added kube07 but nothing is being scheduled on it, even where there are daemonsets etc.

For example, in the monitoring namespace there are 2 daemonsets and both are broken in different ways:

[jonathan@poseidon-gazeley-uk ~]$ kubectl get no
NAME     STATUS   ROLES    AGE   VERSION
kube07   Ready    <none>   74m   v1.26.0
kube05   Ready    <none>   14d   v1.26.0
kube08   Ready    <none>   14d   v1.26.0
kube06   Ready    <none>   14d   v1.26.0
[jonathan@poseidon-gazeley-uk ~]$ kubectl get daemonset
NAME                                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
prometheus-stack-prometheus-node-exporter   4         4         4       4            4           <none>          22h
prometheus-smartctl-exporter-0              4         4         2       4            2           <none>          22h
[jonathan@poseidon-gazeley-uk ~]$ kubectl get po
NAME                                                  READY   STATUS    RESTARTS   AGE
prometheus-smartctl-exporter-0-dsvxt                  1/1     Running   0          23h
prometheus-smartctl-exporter-0-nwppn                  1/1     Running   0          23h
prometheus-smartctl-exporter-0-vgxvg                  1/1     Running   0          23h
prometheus-stack-prometheus-node-exporter-2zpns       2/2     Running   0          23h
prometheus-stack-kube-state-metrics-9b97fb746-kc5rw   1/1     Running   0          23h
prometheus-stack-prometheus-node-exporter-275pd       2/2     Running   0          23h
prometheus-stack-prometheus-node-exporter-g5ftd       2/2     Running   0          23h

There are only 3 of the prometheus-smartctl-exporter and only 2 of the prometheus-stack-prometheus-node-exporter, but the Desired/Current/Ready status is wrong. Something is obviously seriously wrong with the Kubernetes control plane, but I can’t figure out what.

What Should Happen Instead?

  • If a node fails, Kubernetes should reschedule pods on other nodes
  • The reported number of replicas should match the actual number of replicas
  • If the desired number of replicas is not the same as the current number, Kubernetes should schedule more

Reproduction Steps

I can’t consistently reproduce.

Introspection Report

microk8s inspect took a long time to run on all nodes, but reported no errors. It was kube07 that was removed and re-added to the cluster.

kube05-inspection-report-20230206_185244.tar.gz

kube06-inspection-report-20230206_185337.tar.gz

kube07-inspection-report-20230206_185248.tar.gz

kube08-inspection-report-20230206_185353.tar.gz

Can you suggest a fix?

It may be related to #2724 which I reported over a year ago, and was never resolved

Are you interested in contributing with a fix?

Yes, if I can. I feel like this is a serious and ongoing problem with MicroK8s

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 29 (8 by maintainers)

Most upvoted comments

I still can’t figure out what’s going on with this. I don’t have control of my cluster, the dqlite master is running at 100% CPU. I am reluctant to do anything really invasive because I have OpenEBS/CStor volumes on my nodes which obviously depend on the api-server for their own quorum.

My gut instinct is to destroy the cluster and recreate from scratch to restore service, but I had to do this 3 weeks ago too (https://github.com/canonical/microk8s/issues/3204#issuecomment-1398873620), and it just broke again on its own. I don’t have confidence in recent releases of MicroK8s to be stable and durable, especially with hyperconverged storage.

At the moment I’m not touching anything but I’m thinking of improving my off-cluster storage first, and rebuilding without using OpenEBS/CStor, for the security of my data.

I appreciate that MicroK8s is free and that people put a lot of hard work into it, but I feel that there are some serious problems that aren’t getting enough attention. There are several open issues about reliability problems potentially related to dqlite that remain open:

@djjudas21 a MicroK8s cluster starts with a single node to which you join other nodes to form a multinode cluster. As soon as the cluster has three control plane nodes (nodes joined without the --worker flag) it becomes HA. HA means that the datastore is replicated across three nodes and will resist one node failure without disrupting its operation. In an HA cluster you have to make sure you have at least three control plane nodes running all the time. If you drop below 2 nodes the cluster will freeze because there is no quorum to replicate the datastore. In this case (an HA cluster left with a single node) you can recover following the instructions in Recover from lost quorum page. This should explain why when we left the cluster with one node (kube05) you where able to read its state but not change it. That is, you could see the kubectl output but you cloud not do anything on the cluster. In such a frozen cluster the datastore logs (journalctl -fu snap.microk8s.daemon-k8s-dqlite) report errors like context deadline exceeded and database locked.

As we said the first 3 nodes in an HA cluster replicate the datastore. Any write on the datastore needs to be acked by the majority of the nodes, the quorum. As we scale the control-plane nodes beyond 3 the next two nodes maintain a replica of the datastore and are in standby in case a node from the first three departs. When a node misses some heartbeats it gets replaced and the rest of the nodes agree on the role each node plays. This node is stored in /var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml. In the case of the cluster we have:

kube05
- Address: 192.168.0.57:19001
  ID: 3297041220608546238
  Role: 0

kube06
- Address: 192.168.0.58:19001
  ID: 796923914728165793
  Role: 0

kube08 
- Address: 192.168.0.56:19001
  ID: 3971249779275177663
  Role: 0

kube07
- Address: 192.168.0.59:19001
  ID: 16754272739278091721
  Role: 1

Node kube07 has a role of 1 meaning it is standby (replicating the datastore but not participating in the quorum). The rest of the nodes have role 0 meaning they are voters/part of the quorum.

The nodes cannot change IP. Even if they have crashed or left the cluster or are misbehaving they are still considered part of the cluster because they may be rebooting or have some network connectivity problems or go through maintenance etc. If a know a node will not be coming back in its previous state we must call mcirok8s remove-node to let the cluster know that specific node has departed permanently. It is important to point out that if we, for example, “format and reinstall” a node and reuse the hostname and IP this should be considered a departed node so it has to be microk8s remove-node before we join it again.

Of course the above do not explain why the cluster initially started to freeze. Hopefully it may give you some insight on what it happening under the hood. I am not sure what happened around “Feb 06 00:43:43”. Will keep digging in the logs. A way to reproduce the issue would be ideal.

One question I have is why do I see in the dmesg logs on a couple of nodes errors like this:

[1134653.101082] EXT4-fs error (device sdb): __ext4_find_entry:1658: inode #2: comm navidrome: reading directory lblock 0

@MathieuBordere

Thanks. I added the env vars and restarted dqlite at 10:07. Then I let it run for a minute or two and ran microk8s inspect

inspection-report-20230210_101530.tar.gz