cluster-api: While creating a new cluster, CAPI fails to remediate new machines that aren't functional

What steps did you take and what happened: I’m using EKS Anywhere (EKS-A) to create clusters. EKS-A uses CAPI. I’m using Apache CloudStack via CAPC for the infrastructure.

CAPI’s machine health checker (MHC) works well, but only after a cluster is fully created. If there’s a problem during cluster creation that makes a machine unusable, CAPI is unable to identify and remediate the situation. These machine problems could include a VM that boots, but lacks the necessary network connectivity to join the cluster; or a VM that starts running, but the necessary services fail to start on the VM because of a configuration problem.

I started by creating a management cluster. Then I created a workload cluster. To simulate a failure, I ran a script on one of the new workload cluster VMs as soon as it was reachable by SSH. The script disabled the VM’s network adapters.

What did you expect to happen: CAPI should notice that the Machine associated with the failed VM is stuck in the Provisioned phase for far too long, and then replace it. This never happens. EKS-A eventually times out after about 2 hours and leaves the workload cluster in an incomplete state.

Anything else you would like to add: I haven’t done extensive testing with cluster upgrades, but it seems that cluster upgrades can be affected in a similar way to cluster creation.

EKS-A normally adds the machine health checks to the cluster at the very end of the process, after all the machines are created and Cilium and kube-vip are installed. I made a custom build that adds those same health checks before machine creation instead of at the end. The CAPI log showed that MHC was unable to connect to the workload cluster’s endpoint, which makes sense because that cluster hadn’t been created yet.

MHC should be able to use the management cluster’s endpoint during cluster creation because all the objects initially exist on the management cluster. So, I modified the MHC code in CAPI to make it connect to the management cluster endpoint instead of the workload cluster endpoint. That resolved the errors in the CAPI log, but MHC saw all the new VMs as unhealthy and started endlessly deleting and replacing them as fast as they could be provisioned. This may be because the new Machines weren’t associated with Nodes yet.

I also experimented with solutions that don’t use MHC. The most reliable way to detect that a new Machine needs to be replaced seems to be if it stays in the Provisioned phase for more than a few minutes. When this happens, deleting the Machine object usually results in the machine being replaced.

If a Machine object in the Provisioned phase is deleted using code inside CAPI, this seems to be safe to do. But it often results in problems with the cluster, especially if the Machine being deleted is the first control plane machine. The first CP machine is used as the cluster endpoint by kube-vip, and the replacement Machine is somehow initialized differently from the first one. That leaves the cluster in an unusable state.

If a workload Machine object in the Provisioned phase is deleted using code outside of CAPI, it sometimes leaves the management cluster in a bad state. This seems to be caused by a race condition. It happens more frequently if the management cluster has 3 CP nodes, than if it only has 1 CP node. The result of this race condition is that the Machine never gets replaced and the remnants of the workload cluster have to be manually discovered and removed from the management cluster. (Deleting the workload clusters.cluster.x-k8s.io object from the management cluster doesn’t fully clean up the workload cluster like it normally would.) This demonstrates the need to remediate the failure using code inside CAPI, instead of using an external process.

Environment:

  • Cluster-api version: 1.2.0
  • minikube/kind version: 0.16.0
  • Kubernetes version: (use kubectl version): 1.24.2
  • OS (e.g. from /etc/os-release):
    • Kind and kubectl are running on MacOS Monterrey (12.6).
    • The management cluster and workload cluster VMs are running RHEL 8.

/kind bug /area/health /area/machine

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 29 (17 by maintainers)

Most upvoted comments

I discovered a mistake in my experiment 2 days ago. The EKS-A changes I made weren’t always working, so the CNI might not have been installed in time. I fixed my code changes today and tried a couple more times. With the CNI working, MHC was able to detect the health of all the machines, and it remediated the unhealthy ones.

I only made worker machines unhealthy in this test. And EKS-A failed once after MHC replaced a machine, but that might be more of an EKS-A problem than a CAPI problem. I still need to test these changes with unhealthy control plane machines.