cluster-api-provider-aws: Default to spread worker nodes across failure domains

/kind feature

Describe the solution you’d like Currently, CAPI will spread control plane machines across the reported failure domains (i.e. availability zones). It doesn’t do this for worker nodes, machines in a machine deployment (or machines on their own).

Current advice is to create separate machine deployments and manually assign an az (via FailureDomain) to each of the machine deployments to ensure that you have worker machines in different azs.

It would be better when creating machines (if no failure domain is specified on the Machine) that we use the failuredomains on the Cluster and create the machine in a failure domain with the least amount of machine already. CAPI has some functions we could potentially use. Something like this:

machines := collections.FromMachineList(machinesList)
failureDomain := failuredomains.PickFewest(m.Cluster.Status.FailureDomains, machines)

Anything else you would like to add: We need to investigate if this is feasible, or if it is something that should be upstream in machine deployments.

Environment:

Cluster-api-provider-aws version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

About this issue

Original URL
State: open
Created 2 years ago
Comments: 16 (11 by maintainers)

Most upvoted comments

As discussed in the CAPA office hours, Indeed had several CAPA workload clusters (self-managed, non-eks) spanning all az’s in us-east-2 on july 28 2022 during the outage. Our clusters are configured to use machine deployments in each AZ and the cluster autoscaler is configured for autoscaling machine deployments with the clusterapi provider. We also configure the cluster autoscaler and all of the CAPI/CAPA controllers to use leader election and run 3 replicas of each.

What we observed was that when power to AZ1 was lost, 10 minutes later (I believe 10 minutes is due to the 5 minutes delay for the nodes to be marked unready due to missing kubelet heartbeat + 5 minutes for the pod-eviction-timeout of the kube-controller-manager, but I’m not 100% certain), pods were recreated by kubernetes scheduler without any outside interaction, and were in the pending state. The cluster autoscaler scaled up the machine deployments, and as soon as the machines joined the cluster, workloads scheduled and workloads continued to perform normally, despite the control plane being in a degraded state. No human intervention was required for the cluster recovery after AZ1 was restored or during the outage.

Below are two sets of graphs from one of those clusters, which shows the control plane becoming degraded (2/3 available), and then the pods scheduled / created. The pods are scheduled in 3 “waves” as machines join the cluster and then allow more pods to schedule. Screenshot 2023-01-09 at 11-53-37 Kubernetes Control Plane Datadog Screenshot 2023-01-09 at 11-57-30 Kubernetes Control Plane Datadog

I can provide more specific details on how the MD’s were configured if that’s useful.

So I wonder if instead of implementing this feature, documentation on how to correctly configure CAPA clusters to sustain an AZ outage would be more desirable?

cnmcavoy on Jan 9, 2023