cluster-api: When deploying cluster with control-plane-count = 3, api server is not responding during scaling

What steps did you take and what happened:

When I am using CONTROL_PLANE_MACHINE_COUNT: 3 while creating cluster on AWS/VSphere environment, noticed that the api server is not up/responding for in between. Mainly during the control plane vm are being scaled from count=1 to count=3 and for some time after the control plane nodes are up.

Noticed this thing happening when trying to apply clusterctl init once the API server is up, but all the control plane nodes are not yet provisioned.

During the init workflow, api server is non responsive for some time and failed with, Attempt 1: (AWS)

Error: failed to get cert-manager web-hook: rpc error: code = Unavailable desc = etcdserver: leader changed

Attempt 2: (AWS)

Error: failed to get cert-manager web-hook: etcdserver: request timed out

Attempt 3: (VSphere)

Error: failed to get cert-manager web-hook: Get https://192.168.111.79:6443/apis/apiregistration.k8s.io/v1beta1/apiservices/v1beta1.webhook.cert-manager.io: EOF

Attempt 4: (VSphere)

Error: failed to create cert-manager component: /v1, Kind=ServiceAccount, cert-manager/cert-manager-cainjector: rpc error: code = Unavailable desc = etcdserver: leader changed

All the failures were because, api server is not responding or some issue with the etcdserver. Note: cluster init process started and installed all the CRD and all before the failure. So, API server was responsive when the init process was started.

What did you expect to happen:

API server should always be responsive during the control plane scaling operation.

Environment:

  • Cluster-api version: commit v0.3.0-rc.2-82-gaa289e08a
  • On vsphere and aws

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 20 (16 by maintainers)

Most upvoted comments

Going to close this one now that #2563 has been merged and we haven’t heard back

/close

One immediate problem:

https://github.com/kubernetes-sigs/cluster-api/blob/ac1dee8cf441dda7f1a36fd4e149e195ef16db20/cmd/clusterctl/client/cluster/cert_manager.go#L144-L146

which can encounter this error if e.g. the apiserver is having issues: https://github.com/kubernetes-sigs/cluster-api/blob/ac1dee8cf441dda7f1a36fd4e149e195ef16db20/cmd/clusterctl/client/cluster/cert_manager.go#L207

If we return the error, then cm.pollImmediateWaiter is going to return immediately instead of retrying. This is one specific thing we can adjust. cc @fabriziopandini @vincepri