microk8s: Inconsistent control plane state
Summary
Something has gone wrong with (I think) the api-server or dqlite backend. Cluster state is inconsistent and replication controllers are not working.
I have 4 nodes (kube05-kube08) and earlier today I noticed that some of my workloads stopped. I saw that kube07 had some stuck processes and 100% CPU usage on it, so I attempted to drain it and reboot it. It never drained properly; all the pods that were on kube07 went to Unknown state and never got rescheduled.
I did microk8s leave on kube07 but all the kube07 resources remained on the cluster. I force-deleted all of those pods but they never got rescheduled. Now I re-added kube07 but nothing is being scheduled on it, even where there are daemonsets etc.
For example, in the monitoring namespace there are 2 daemonsets and both are broken in different ways:
[jonathan@poseidon-gazeley-uk ~]$ kubectl get no
NAME STATUS ROLES AGE VERSION
kube07 Ready <none> 74m v1.26.0
kube05 Ready <none> 14d v1.26.0
kube08 Ready <none> 14d v1.26.0
kube06 Ready <none> 14d v1.26.0
[jonathan@poseidon-gazeley-uk ~]$ kubectl get daemonset
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
prometheus-stack-prometheus-node-exporter 4 4 4 4 4 <none> 22h
prometheus-smartctl-exporter-0 4 4 2 4 2 <none> 22h
[jonathan@poseidon-gazeley-uk ~]$ kubectl get po
NAME READY STATUS RESTARTS AGE
prometheus-smartctl-exporter-0-dsvxt 1/1 Running 0 23h
prometheus-smartctl-exporter-0-nwppn 1/1 Running 0 23h
prometheus-smartctl-exporter-0-vgxvg 1/1 Running 0 23h
prometheus-stack-prometheus-node-exporter-2zpns 2/2 Running 0 23h
prometheus-stack-kube-state-metrics-9b97fb746-kc5rw 1/1 Running 0 23h
prometheus-stack-prometheus-node-exporter-275pd 2/2 Running 0 23h
prometheus-stack-prometheus-node-exporter-g5ftd 2/2 Running 0 23h
There are only 3 of the prometheus-smartctl-exporter and only 2 of the prometheus-stack-prometheus-node-exporter, but the Desired/Current/Ready status is wrong. Something is obviously seriously wrong with the Kubernetes control plane, but I can’t figure out what.
What Should Happen Instead?
- If a node fails, Kubernetes should reschedule pods on other nodes
- The reported number of replicas should match the actual number of replicas
- If the desired number of replicas is not the same as the current number, Kubernetes should schedule more
Reproduction Steps
I can’t consistently reproduce.
Introspection Report
microk8s inspect took a long time to run on all nodes, but reported no errors. It was kube07 that was removed and re-added to the cluster.
kube05-inspection-report-20230206_185244.tar.gz
kube06-inspection-report-20230206_185337.tar.gz
kube07-inspection-report-20230206_185248.tar.gz
kube08-inspection-report-20230206_185353.tar.gz
Can you suggest a fix?
It may be related to #2724 which I reported over a year ago, and was never resolved
Are you interested in contributing with a fix?
Yes, if I can. I feel like this is a serious and ongoing problem with MicroK8s
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 29 (8 by maintainers)
I still can’t figure out what’s going on with this. I don’t have control of my cluster, the dqlite master is running at 100% CPU. I am reluctant to do anything really invasive because I have OpenEBS/CStor volumes on my nodes which obviously depend on the api-server for their own quorum.
My gut instinct is to destroy the cluster and recreate from scratch to restore service, but I had to do this 3 weeks ago too (https://github.com/canonical/microk8s/issues/3204#issuecomment-1398873620), and it just broke again on its own. I don’t have confidence in recent releases of MicroK8s to be stable and durable, especially with hyperconverged storage.
At the moment I’m not touching anything but I’m thinking of improving my off-cluster storage first, and rebuilding without using OpenEBS/CStor, for the security of my data.
I appreciate that MicroK8s is free and that people put a lot of hard work into it, but I feel that there are some serious problems that aren’t getting enough attention. There are several open issues about reliability problems potentially related to dqlite that remain open:
@djjudas21 a MicroK8s cluster starts with a single node to which you join other nodes to form a multinode cluster. As soon as the cluster has three control plane nodes (nodes joined without the
--workerflag) it becomes HA. HA means that the datastore is replicated across three nodes and will resist one node failure without disrupting its operation. In an HA cluster you have to make sure you have at least three control plane nodes running all the time. If you drop below 2 nodes the cluster will freeze because there is no quorum to replicate the datastore. In this case (an HA cluster left with a single node) you can recover following the instructions in Recover from lost quorum page. This should explain why when we left the cluster with one node (kube05) you where able to read its state but not change it. That is, you could see the kubectl output but you cloud not do anything on the cluster. In such a frozen cluster the datastore logs (journalctl -fu snap.microk8s.daemon-k8s-dqlite) report errors likecontext deadline exceededanddatabase locked.As we said the first 3 nodes in an HA cluster replicate the datastore. Any write on the datastore needs to be acked by the majority of the nodes, the quorum. As we scale the control-plane nodes beyond 3 the next two nodes maintain a replica of the datastore and are in standby in case a node from the first three departs. When a node misses some heartbeats it gets replaced and the rest of the nodes agree on the role each node plays. This node is stored in
/var/snap/microk8s/current/var/kubernetes/backend/cluster.yaml. In the case of the cluster we have:Node
kube07has a role of 1 meaning it is standby (replicating the datastore but not participating in the quorum). The rest of the nodes have role 0 meaning they are voters/part of the quorum.The nodes cannot change IP. Even if they have crashed or left the cluster or are misbehaving they are still considered part of the cluster because they may be rebooting or have some network connectivity problems or go through maintenance etc. If a know a node will not be coming back in its previous state we must call
mcirok8s remove-nodeto let the cluster know that specific node has departed permanently. It is important to point out that if we, for example, “format and reinstall” a node and reuse the hostname and IP this should be considered a departed node so it has to bemicrok8s remove-nodebefore we join it again.Of course the above do not explain why the cluster initially started to freeze. Hopefully it may give you some insight on what it happening under the hood. I am not sure what happened around “Feb 06 00:43:43”. Will keep digging in the logs. A way to reproduce the issue would be ideal.
One question I have is why do I see in the dmesg logs on a couple of nodes errors like this:
@MathieuBordere
Thanks. I added the env vars and restarted dqlite at 10:07. Then I let it run for a minute or two and ran
microk8s inspectinspection-report-20230210_101530.tar.gz