cluster-api: Certmanager pods not starting because of untolerated taint control-plane
What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]
Tried to init cluster via clusterctl init --core cluster-api --bootstrap talos --control-plane talos --infrastructure hetzner. Got error: timed out waiting for connection.
Certmanager pods give following error: Warning FailedScheduling 13m default-scheduler 0/3 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }.
What did you expect to happen: Certmanager pods starting normally with control plane taint toleration.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
Environment:
- Cluster-api version: v1.2.1
- Cert-manager version: v1.8.2
- minikube/kind version: v0.14.0 (but not used in this example)
- Kubernetes version: (use
kubectl version): v1.24.0 - OS (e.g. from
/etc/os-release): Talos v1.2.0-alpha.2
/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (9 by maintainers)
thanks for the info. then, I’m personally +1 to close this issue given that we cannot force cert-manager to add this taint and there are already ways to provide custom cert-manger yaml and/or use pre-installed cert-manager versions.
If in the future we will recognize that running a management cluster entirely on a CP tainted node is a scenario we want to officially support, we can re-open and plan for it (including proper testing for the use case).
It looks like cert-manager stopped being created from a yaml with https://github.com/kubernetes-sigs/cluster-api/pull/4748 so configuration is now done through a yaml / override pointed to by users - with the default being the cert manager upstream hard coded in a CAPI release.
We could encourage users to customize their own yamls here, this case could be incorporated in clusterctl. This issue does seem like it might be common and could be a gotcha for users. The tolerations could be injected into the object into the object, similar to how the were previously hardcoded in our yamls.
Possibly this is the right place to do this: https://github.com/kubernetes-sigs/cluster-api/blob/3d06f17bef6c7463e5d22028399b081dd85a3ecf/cmd/clusterctl/client/cluster/cert_manager.go#L177
Currently they tolerate neither the old nor the new taint. (just fyi)
Not directly through CAPI - however you can use your own custom Cert manager yaml which uses the tolerations: https://cluster-api.sigs.k8s.io/clusterctl/configuration.html#cert-manager-configuration
You can download the same version and CAPI and add the fields with kustomize or whichever tool you’re using for config management.
This should solve the problem for now while this issue waits for triage and for someone to pick it up.
v1.8.2
I think the behavior is kind of expected. Usually we use kind or a management cluster with worker nodes as management cluster.
In both cases there are nodes without a control plane taint.
What are you using as management cluster?