autoscaler: [cluster-autoscaler] CriticalAddonsOnly taint ignored
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
1.20.0
Component version:
What k8s version are you using (kubectl version)?:
1.20.5
kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6", GitCommit:"8a62859e515889f07e3e3be6a1080413f17cf2c3", GitTreeState:"clean", BuildDate:"2021-04-15T03:28:42Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"54684493f8139456e5d2f963b23cb5003c4d8055", GitTreeState:"clean", BuildDate:"2021-03-22T23:02:59Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
What environment is this in?:
Azure w/ VMSS scaling configured via auto-discovery
What did you expect to happen?:
When I apply taint CriticalAddonsOnly=true:NoSchedule to a node group, it should be respected.
What happened instead?:
This particular taint is not respected during scaling calculations.
How to reproduce it (as minimally and precisely as possible):
Create a node group with this taint and one without, scale a deployment without a matching toleration, notice that the tainted node group will get scaled occasionally.
Anything else we need to know?:
This appears to be a result of this taint filtering logic; https://github.com/kubernetes/autoscaler/blob/d33cc1bc400f6d5cccaa0ca95696fa0e1780df29/cluster-autoscaler/utils/taints/taints.go#L70
I believe this is the problem in this old issue that went unresolved: https://github.com/kubernetes/autoscaler/issues/2434
After searching the repository, I could not find any references to this value in the core logic. While I may be missing something, I am not sure what the intended purpose of this filter is. The only thing I can find is a reference to a “rescheduler”. I think it may be related to this KEP but I’m not sure if its still relevant: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/rescheduling.md
It is hindering my use case in the Azure/AKS b/c that is the only taint that can be added to the default system node pool. Frequently the scaling is very delayed as the autoscaler will scale this node pool incorrectly and repeatedly. My hacky workaround is to just have the autoscaler ignore this node pool, but I am afraid this unexpected behavior will continue to surprise people. Moreover, I’d like to be able to not ignore this pool in the future.
(1) Can it be removed? (2) If not, can a warning and explanation be added to the FAQ?
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 3
- Comments: 58 (37 by maintainers)
Commits related to this issue
- Remove `CriticalAddonsOnly` toleration This taint is not documented to be involved in any special handling according to the official Kubernetes documentation, see: - https://github.com/kubernetes/aut... — committed to timuthy/gardener by timuthy a year ago
- Drop `nodeSelector` and `tolerations` from Shoot System Components (#7304) * Remove system components node selector * Remove `CriticalAddonsOnly` toleration This taint is not documented to be invol... — committed to gardener/gardener by timuthy a year ago
- Clean up tolerations `CriticalAddonsOnly` taint is not documented to be involved in any special handling according to the official Kubernetes documentation, see: - https://github.com/kubernetes/autos... — committed to timuthy/gardener-extension-networking-calico by timuthy a year ago
- Clean up tolerations `CriticalAddonsOnly` taint is not documented to be involved in any special handling according to the official Kubernetes documentation, see: - https://github.com/kubernetes/autos... — committed to timuthy/gardener-extension-networking-calico by timuthy a year ago
I’m guessing
CriticalAddonsOnlybecame a private taint when we introduced the registry for well known ones (Apr 20, 2017 according to Git history). We didn’t go around and look for legacy uses across the project at that point.As it’s been around so long, I strongly agree on the value of some kind of deprecation process.
Contrary to what comments above say - it’s not a private extension, CriticalAddonsOnly was an official Kubernetes taint (to the extent anything was official in those days - it was used by core k8s code which I think was how we defined “official” back in 2017 😃 ). It’s not in well-known labels and it doesn’t have kubernetes.io/ prefix, because it predates organizing well-known labels or standardizing on prefix (same reason why CA’s own to be deleted taint doesn’t use prefix).
This taint was used by ‘rescheduler’ which very much existed and wasn’t just a proposal. I don’t remember the details after all those years, but back than pod priority didn’t exist and daemonset pods were scheduled directly by DaemonsetController, not by scheduler. IIRC Rescheduler was basically a separate controller doing equivalent of today pod preemption for system pods. Since there was no pod priority back than it had to temporarily taint the node while performing the preemption, otherwise any pod could schedule using the capacity freed by preemption. The only way to guarantee the intended pod would use the capacity was to temporarily taint a node so that non-system pods couldn’t schedule there. Once the preemption was done, the taint would be removed. I’m sure I’m getting some details wrong, but I think that was the general idea. Rescheduler seems to have been removed in 1.11: https://github.com/kubernetes/kubernetes/pull/67687.
So yeah - this is very much legacy. I don’t think CA should be required to maintain support for a feature removed from kubernetes back in 2018 and I’d be happy to approve a PR removing it.
Now - the issue brought on sig autoscaling seems to have been triggered by the idea of using this particular taint to mark control plane nodes. I don’t think that’s consistent with how CriticalAddonsOnly taint was originally intended to be used and I guess that could be considered a private extension (using an “official” taint for a different purpose). I have no idea what platforms out there may be using and whether they will be impacted by the change and it will technically be backward incompatible change, so I’d ask anyone making this change to make sure to include a release note that clearly calls out the backward-incompatible change in behavior.
Given that the CA already has flags for ignoring certain taints, it seems to me implicit + immutable ignore behavior should be removed. It might generate some very minor work for cloud providers that have relied on the implicit, undocumented behavior, but as long as this is coupled with a major version bump that seems completely reasonable (in my personal opinion).
Doing the simplest thing first based on my limited understanding: https://github.com/kubernetes/autoscaler/pull/5838 (WIP)
BTW,
CriticalAddonsOnlyis not an official Kubernetes taint; if it were, you’d see it listed in [Well-Known Labels, Annotations and Taints](Well-Known Labels, Annotations and Taints) and it would be prefixed, eg withkubernetes.io/.For example, nodes can come online tainted as
node.cloudprovider.kubernetes.io/uninitialized. That’s a real, registered taint and the prefix makes it public. Other organizations can register their own public taints with a prefix they control.Using a private taint,
CriticalAddonsOnly, is unhelpful and cluster operators should try not to do so. On the other hand, cluster-autoscaler is generally available.I think that cluster-autoscaler’s use of this private taint represents an architecture bug.
@jclangst I’d suggest first implementing option 1 for v1.23.0 as that should be the correct behaviour moving forwards. Then if possible implementing option 2 to be backported as a patch to v1.21 & v1.22 would give a solution for Azure.
Link to agenda for upcoming sig meeting: https://docs.google.com/document/d/1RvhQAEIrVLHbyNnuaT99-6u9ZUMp7BfkPupT2LAZK7w/edit#bookmark=id.e9v2tiuubvsq
/assign vadasambar
Came to know about this issue today because of https://kubernetes.slack.com/archives/C09R1LV8S/p1685972225498489
I am not sure if I have fully grasped the details here. I have added it as an item on the agenda for the upcoming sig meeting to decide what to do about this. Once we reach a decision, I would be happy to raise a PR (I am working on 2 other PRs so some delay should be expected).
I am blocked by the same issue, and have to fork cluster-autoscaler to remove this special case to get my clusters to work the way I expect them to.
I would really appreciate at least an option to toggle this behavior.
@pierluigilenoci Well I had to try. Ultimately, we just want our configurations to work regardless of where our cluster is hosted!
@stevehipwell here: https://github.com/Azure/AKS/issues/2513
@stevehipwell You are making the claim that there is an “existing special taint.” Kubernetes has specific documentation for all of common taints which you can find here; the taint is not listed in the spec. I have also done the due diligence of searching the source code for ALL of the core Kubernetes components (including
kubeadmandkube-scheduler) and there is no indication that this taint is used. I highly encourage you to do a thorough investigation yourself. I am really not sure what more I can provide that would change your mind.That said, I do appreciate the effort you have put into searching for the deployment patterns. This is a holdover from old (5+ years) versions of Kubernetes that DID use this taint (for the rescheduler proposal), I have linked to that proposal in a previous comment, and I encourage you to review the sequence of events yourself.
I definitely agree with your assessment that its possible that the CA has never worked on these AKS nodes. But with the context of all of the other evidence, I do not feel this is a bug with AKS but rather a bug with CA.
I think this is a cogent consideration. Fortunately, no documentation would need to be changed as the behavior is completely undocumented, hence why my teams ran into unexpected behavior in the first place. Given that the current versions of Kubernetes do not use this taint (again, please feel free to review the source), it is not in the official Kubernetes spec, and its undocumented behavior of the CA, I don’t see how this could be anything other than a bug. I do not see the issue of putting bugfixes into current version (and even backporting the bugfix to previous versions).
That said, if you can provide an example of how this may be disruptive, I definitely want to make sure your use cases are considered. It could be that I am simply not understanding how this would affect your setup. What I can say is that the current behavior is negatively impacting the setups of most AKS customers, myself included.