cluster-api: When deploying cluster with control-plane-count = 3, api server is not responding during scaling
What steps did you take and what happened:
When I am using CONTROL_PLANE_MACHINE_COUNT: 3
while creating cluster on AWS/VSphere environment, noticed that the api server is not up/responding for in between. Mainly during the control plane vm are being scaled from count=1 to count=3 and for some time after the control plane nodes are up.
Noticed this thing happening when trying to apply clusterctl init
once the API server is up, but all the control plane nodes are not yet provisioned.
During the init workflow, api server is non responsive for some time and failed with, Attempt 1: (AWS)
Error: failed to get cert-manager web-hook: rpc error: code = Unavailable desc = etcdserver: leader changed
Attempt 2: (AWS)
Error: failed to get cert-manager web-hook: etcdserver: request timed out
Attempt 3: (VSphere)
Error: failed to get cert-manager web-hook: Get https://192.168.111.79:6443/apis/apiregistration.k8s.io/v1beta1/apiservices/v1beta1.webhook.cert-manager.io: EOF
Attempt 4: (VSphere)
Error: failed to create cert-manager component: /v1, Kind=ServiceAccount, cert-manager/cert-manager-cainjector: rpc error: code = Unavailable desc = etcdserver: leader changed
All the failures were because, api server is not responding or some issue with the etcdserver. Note: cluster init process started and installed all the CRD and all before the failure. So, API server was responsive when the init process was started.
What did you expect to happen:
API server should always be responsive during the control plane scaling operation.
Environment:
- Cluster-api version: commit v0.3.0-rc.2-82-gaa289e08a
- On vsphere and aws
/kind bug
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 20 (16 by maintainers)
Going to close this one now that #2563 has been merged and we haven’t heard back
/close
One immediate problem:
https://github.com/kubernetes-sigs/cluster-api/blob/ac1dee8cf441dda7f1a36fd4e149e195ef16db20/cmd/clusterctl/client/cluster/cert_manager.go#L144-L146
which can encounter this error if e.g. the apiserver is having issues: https://github.com/kubernetes-sigs/cluster-api/blob/ac1dee8cf441dda7f1a36fd4e149e195ef16db20/cmd/clusterctl/client/cluster/cert_manager.go#L207
If we return the error, then
cm.pollImmediateWaiter
is going to return immediately instead of retrying. This is one specific thing we can adjust. cc @fabriziopandini @vincepri