cluster-api: KubeadmConfig changes should be reconciled for machine pools, triggering instance recreation
What would you like to be added (User Story)?
As operator, I want KubeadmConfig.spec.{files,preKubeadmCommands,...} changes to have an effect on MachinePool-creates nodes, resulting in server instance recreation.
Detailed Description
Discussed in office hours 2023-06-14 (notes, main points copied into this issue below).
Situation: MachinePool manifest references AWSMachinePool and KubeadmConfig (a very regular machine pool config)
Expectation: Changing KubeadmConfig.spec.* should lead to recreating (“rolling”) nodes. With infra provider CAPA, nothing happens at the moment. Here’s why.
-
Problem 1: CAPI’s
KubeadmConfigReconcilerdoes not immediately update the bootstrap secret onceKubeadmConfig.specchanges, but only once it rotates the bootstrap token (purpose: new machine pool-created nodes can join the cluster later on). This means several minutes of waiting for reconciliation.- Suggestion: Simple bug fix. @AndiDog has a draft implementation that always considers updating the secret, not only if the token must be refreshed. In the meantime, users can work around by creating a new KubeadmConfig object.
-
Problem 2: CAPA (and likely all other infra providers) does not watch the bootstrap secret, so it cannot immediately react to
KubeadmConfig.specchanges either.- @AndiDog Should it even directly watch the secret? What should the CAPI ⇔ CAPx contract be?
- @fabriziopandini: Watching secrets can blow up memory [of the kubectl client]. Think of the UX and possible solutions first.
- @CecileRobertMichon: Maybe change
MachinePool.spec.template.spec.bootstrap.dataSecretNameevery time because that triggers reconciliation for theMachinePoolobject (machinepool_controller_phases.go code). - @sbueringer: For
MachinePoolsupport inClusterClasswe have to decide what the “ideal” way to rollout BootstrapConfig is
-
Problem 3: The bootstrap secret contains both the “how to set up this server init data” (e.g. cloud-init / ignition) and the random bootstrap token by which nodes join the cluster. If only the token gets refreshed (
DefaultTokenTTLis 15 minutes), we don’t want nodes to be recreated, since that would recreate all nodes every few minutes.- @AndiDog CAPA does not trigger a node rollout (AWS ASG instance refresh) right now (https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/4071) because it has no logic to detect when the bootstrap config changed. How can CAPA (or another infra provider) tell that something apart from the bootstrap token has changed? We have a CAPA PR (https://github.com/kubernetes-sigs/cluster-api-provider-aws/pull/4245) that checks “is there a difference apart from only the bootstrap token value” but that is very hacky and format/provider-specific. Maybe a “checksum without bootstrap token” provided by CAPI could help the infra provider?
- @CecileRobertMichon: Split bootstrap token vs. other init data?
Anything else you would like to add?
- CAPA-specific issue: https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/4071. The problem applies to all infra providers, though.
- This issue asks for mutable KubeadmConfig, while another proposal wants to make it immutable even for machine pool (breaking change): https://github.com/kubernetes-sigs/cluster-api/issues/4910.
Label(s) to be applied
/kind feature /area bootstrap /area machinepool
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 1
- Comments: 24 (24 by maintainers)
One idea was confirmed as reasonable by several people in the office hours: consider changing the bootstrap object reference as desire to roll out. I had the doubt that both the old and new bootstrap config (with different names) could produce the same output and the user could be surprised that all nodes get replaced. But then again, @CecileRobertMichon and others rightfully commented “Why would you create a new one with the same content?” since that seems unusual.
Next, I will try this solution and see whether other opinions appear as to how this can be solved. And @vincepri suggested factoring out a new issue for splitting off the random, long-lived bootstrap token from
KubeadmConfig, since it’s intermixed here but really a separate problem. I’ll create an issue for that.Thanks for tagging me @CecileRobertMichon. I’ll try to add my thoughts in general as well as in the context of the WIP proposal for addressing #5294.
Terminology disamiguation
First, I think we have to update our terminology to clearly distinguish between two related yet distinct concerns:
Note that the bootstrap process is based on top of provisioning: We currently use cloud-init or Ignition in order to convey kubeadm configuration to a machine as well as to execute the shell commands which trigger the kubeadm init/join process. However, provisioning isn’t the same as bootstrap because provisioning can also contain OS-level customizations which aren’t related to bootstrap (e.g. tweaking a sysctl parameter). Furthermore, even if provisioning contained only bootstrap-related configuration, IMO we’d still need the separate terminology because kubeadm configuration isn’t the same as cloud-init or Ignition configuration and it’s important to distinguish between the two given the different stages performed in order to get from a declarative kubeadm config to a VM with a running kubelet.
My thoughts
With that said, here are my thoughts:
Machine recreation on bootstrap config change
Following what @fabriziopandini said, I think that before we figure out how to recreate machines following a change to the bootstrap config, we should decide whether we want to allow doing so in the first place. Personally, I don’t see why not but I don’t have a strong opinion here. Regardless of what gets decided, I agree that this should either be supported and working or explicitly unsupported and therefore documented and rejected by a webhook. In any case, I think we can agree that “supported by accident” isn’t something we want.
Multiple checksums
I don’t think we want to maintain two bootstrap config checksums for the following reasons:
I’d have to dig deeper into the codebase to get a clearer understanding of the problem, but I’m guessing the thing which shouldn’t be hashed (bootstrap token) should have resided elsewhere if it’s not a part of
{ things we watch when deciding whether to replace a machine }. I don’t have an alternative suggestion at the moment, but chances are we’ll have to address this concern as a byproduct of working on #5294 anyway.CAPI <> CAPx contract
Great point about the CAPI <> CAPx contract @AndiDog. This should be a core piece of the #5294 work and I’ve already started thinking about it. I’ll try to copy you in relevant discussions around the proposal process.
We already have https://cluster-api.sigs.k8s.io/developer/providers/machine-infrastructure which might be relevant but I need to dig deeper and see if we have an entire contract missing or we just need to update an existing one.
High-level description of #5294
The #5294 proposal is still WIP, but very briefly here is what I envision at the moment:
The proposal needs a lot more fleshing out so I don’t know about a timeline. I’m still gathering requirements all over the place since #5294 touches a very central area of the codebase which affects a lot of use cases. My hope is that we can cleanly separate the effort into manageable tasks and focus on a narrow change set at a given time without losing the bigger picture. Maybe we should break #5294 down to a couple of smaller proposals - we’ll see.
Infra providers watching KubeadmConfig
Following from the above, I don’t think infra providers should watch KubeadmConfig because I think it goes against the separation of concerns we want to eventually achieve. Infra providers should watch a location with bootstrapper-agnostic provisioning config (e.g. a cloud-config document) and pass that config to machines.
Any comments on the above are welcome. Sorry about the 📜🙂
Frankly speaking, what I found confusing in the UX is that we are changing an external reference in place, and this is inconsistent with everything else in CAPI. Additionally, the reference to a bootstrap object is defined inside a machine template, while e.g. in MD/MS we have a similar API modeling but we use a bootstrap template instead of a bootstrap object.
However, IMO while designing a solution for problem 1, I think we should aim for a solution where changes to external objects should surface to the parent objects, no matter if user-driven (e.g rotating template references) or machine-driven (by changing the data secret name).
This approach - changing the parent object - is consistent with everything else in CAPI and provides a clean pattern for all the providers, and IMO this could address problem 2.
WRT to problem 3, It seems to me that what we are saying is going in the direction of each machine having its own bootstrap token so machine provisioning is not affected by changed to concurrent changes to the bootstrap object, and this could be somehow overlapping with @Jont828 work (see eg. https://github.com/kubernetes-sigs/cluster-api/pull/8828#discussion_r1228631689)
Last note, what we are deciding
@AndiDog In #8842, we’re adding MachinePool Machines to DockerMachinePools. This would affect the ClusterClass rollout e2e test as we wouldn’t be able to rollout changes to the KubeadmConfig to the MachinePool Machines, so I’ve made a change to exclude them for now. But in the future, we’d want to revert this change. Tagging @willie-yao as well.
One additional thought: we might want to think about how to deal with “orphan” configs (for example, avoid rotating the KubeadmConfig bootstrap token if the config isn’t actually in use by a MachinePool)
We discussed this in a meeting (@CecileRobertMichon, @fabriziopandini, @vincepri) and decided to try and support the reference rotation (changed
MachinePool.spec.template.spec.bootstrap.configRef.name) as described in https://github.com/kubernetes-sigs/cluster-api/issues/8858#issuecomment-1737930898. Other solutions didn’t seem so natural, or are more complex. I’ll give it a shot and report back.https://github.com/kubernetes-sigs/cluster-api/issues/5294 is a discussion way bigger than this specific issue.
That means that if we put this in the critical path most probably it will delay the solution to this specific problem and move off the table more tactical alternatives we were considering till now like https://github.com/kubernetes-sigs/cluster-api/issues/8858#issuecomment-1719109202 or the idea I was mulling about in https://github.com/kubernetes-sigs/cluster-api/issues/8858#issuecomment-1719700381 about leveraging on object rotation.
I just want to make sure everyone agrees on the implications, starting from @AndiDog who reported this issue.
Here is the current bootstrap provider contract: https://cluster-api.sigs.k8s.io/developer/providers/bootstrap
If we add the field to MachinePoolStatus based on Boostrap config status field we would need to either make the status field part of the boostrap status contract or make it opt-in/optional so it doesn’t break other bootstrap providers that don’t implement the field.
cc @johananl @richardcase who are working on https://github.com/kubernetes-sigs/cluster-api/issues/5294 which might help simplify this
yes, that’s correct
This is a dealbreaker if we’re talking about watching specific Kubeadm* types. Infra providers should be bootstrap provider agnostic and work with all bootstrap providers and control providers as kubeadm is not the only one out there. If we’re trying to watch the bootstrap reference that would be fine as we can do that without referencing kubeadmconfig specifically. In that case,
status.dataSecretPartialChecksumwould need to become part of the bootstrap provider contract and every bootstrap provider would need to implement it.