karpenter: Custom Deprovisioning Trigger Mechanism
Description
Allow users to trigger node Drift
What problem are you trying to solve?
In Karpenter pre-v0.28, I had started using the karpenter drift annotation karpenter.sh/voluntary-disruption=drifted
as a way to force nodes to get replaced in an orderly fashion when I changed configuration that was not supported by Karpenters Drift detection.
In v0.28 this was removed and now the annotation is simply removed by Karpenter.
I found the ability to trigger drift useful in testing and in filling in the gaps in Drift support. I’d also assume long term, there may be corner cases users would want to trigger replacement on that drift cannot detect or detect easily.
Perhaps just another annotation indicating user requested drift so that Karpenter can replace nodes in an orderly manner and while respecting deprovisioning controls.
How important is this feature to you?
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 13
- Comments: 16 (6 by maintainers)
Hey folks sorry for the delay, I am starting work on this, this week. Will keep you posted here.
One workflow that has come up that might be applicable for this is how to eventually force all pods that are using
karpenter.sh/do-not-evict: "true"
to get consolidated/drifted in a timely manner.We’d found that the nodes end up getting left around since Karpenter is also not cordoning the nodes that these pods are on to allow them to eventually drain out of the cluster.
Another thought though would be to modify how this annotation is treated if maintenance window type support was added to Karpenter (ref: aws/karpenter-core#753). I think at least operationally allowing a maintenance window to force node replacements may be desirable by a lot of organizations.
I mean, right now there are a ton of gaps. It was quite useful that I could just annotate all at once and Karpenter would do a serial replacement to minimize disruption in the cluster. It generally hasn’t been clear to me how disruptive it’d be to just delete a pile of nodes wholesale.
Long term, I was sorta expecting there are always going to be gaps (is karpenter going to be able to drift on userdata changes for example?).
My current migration is dockerd -> containerd which is not drifted.
I don’t think I have slack probably this quarter to get into this.