descheduler: HighNodeUtilization does nothing when all nodes are underutilized
What version of descheduler are you using?
descheduler version: 0.23
Does this issue reproduce with the latest release? Yes
Which descheduler CLI options are you using? Using helm chart 0.23.1 with these overrides:
deschedulerPolicy:
strategies:
HighNodeUtilization:
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
memory: 20
numberOfNodes: 0
LowNodeUtilization:
enabled: false
schedule: 5 10 * * *
What k8s version are you using (kubectl version
)?
kubectl version
Output
$ kubectl version Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
What did you do?
Installed descheduler. Ran the cron job manually.
What did you expect to see?
Descheduler evict pods from some of the underutilized instances.
What did you see instead?
Descheduler didn’t evict any pods. Saw this in the log:
I0214 21:57:50.046412 1 highnodeutilization.go:90] "Criteria for a node below target utilization" CPU=100 Mem=20 Pods=100
I0214 21:57:50.046417 1 highnodeutilization.go:91] "Number of underutilized nodes" totalNumber=4
I0214 21:57:50.046421 1 highnodeutilization.go:102] "All nodes are underutilized, nothing to do here"
These lines of code https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/descheduler/strategies/nodeutilization/highnodeutilization.go#L101-L103 seem to be very similar to code in LowNodeUtilization https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/descheduler/strategies/nodeutilization/lownodeutilization.go#L128-L130, perhaps a stray copy-paste?
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 29 (16 by maintainers)
This seems like a bug to me, probably this line was carried over from LowNodeUtilization where a check like this makes more sense. But the point of HighNodeUtilization is to try to achieve bin-packing. So when all nodes are underutilized, we should definitely be taking some action, the question is what should that be?
I think we could take a more opinionated approach than this. Since HighNodeUtilization is specifically trying to evict the least-utilized nodes, that gives us a goal to work toward. Maybe we could sort the nodes by utilization and evict pods from the least-utilized nodes until the higher nodes become full (or at least we assume they’ll be full). Similar to the pod topology spread balancing strategy.
If the goal of HighNodeUtilization is to bin-pack, then imo the answer is “all the nodes”. For example if you have 4 nodes with these utilizations:
Then evicting from the lowest on to the highest would follow these steps:
Since the scheduler doesn’t have any configurable threshold for “fully-packed” (ie, it always tries for 100% utilization when configured this way), then staying in sync with the assume re-scheduling strategy would use this assumption that we want as few nodes as possible with 100% and as many as possible with 0%
I wonder what’s the percentage of fully evictable nodes compared to “fully-packed” nodes in the wild. Given the descheduler can not simulate what’s the optimal number of nodes to be left intact, we still need some artificial threshold/way of saying “this is where we stop evicting so we don’t over-flood remaining nodes”. The current algorithm is too simple and presumably sub-optimal to take into account the assumption.
I wonder if it is worth extending the current implementation towards the assumption or re-implementing it (maybe as a new strategy replacing this one) to evict pods from the lowest to the highest utilized nodes. I have not explored this path yet. Open to discussion.
Questions to think about:
If all nodes are underutilized how can one decide which nodes should be completely drained and which should stay as potential target nodes for attracting evicted pods?
I wonder how many least-utilized nodes should we evict from, how many higher nodes should we target? (taking into account not just the case where all the nodes are underutilized)
The threshold was introduced as a trivial solution for dividing nodes into two groups. One group of nodes for evicting pods, the other one as potential targets for collocating nodes to achieve the bin-packing (up to the kube-scheduler to decide which nodes are to be targeted). In this case the threshold is too high to evict pods from any node. So you need adaptive approach which will lower the threshold. Something like “if all nodes are underutilized, find the first X% of nodes which have the lowest utilization (wrt. native and extended resources) and dynamically adjust the threshold”. The
X
can be set to 50% or higher based on customer use cases. E.g.:The strategy would then “just” adapt the threshold and run the strategy one more time with the new threshold. With expectation that next time there’s gonna be at least one node that is minimally utilized (on the scale [underutilized, utilized, overutilized]).
@slobo I wonder if dynamically adapting the threshold would be something applicable in your use case(s)?
if all nodes are underutilized, sort the nodes by utilization and evict pods on the least-utilized nodes. very good idea~~ could i have a try on this enhancement?
I agree, I’d expect that the least utilized node get evicted.
This is somewhat tangential, perhaps we could have an option for descheduler to aggressively force “optimal” packing by
cordon
ing off the nodes before evicting pods from them? That way there would be no chance of pods ending back onto low utilization nodes, regardless of the scheduler configuration - i.e. you wouldn’t need to setup NodeResourcesFit = MostAllocated to get at least some effect from descheduling.