kubernetes: HPA refuses to scale if any custom metric is missing
/kind bug
What happened:
I setup custom metrics based autoscaling with multiple metrics. One of those metrics was not available. HPA reports all metrics as “unknown” (even the ones that are still available) and refuses to operate.
Metrics: ( current / target )
"sockjs_sessions_current" on pods: <unknown> / 500
"ddp_method_calls" on pods: <unknown> / 25
"http_requests" on pods: <unknown> / 25
resource cpu on pods (as a percentage of request): <unknown> / 60%
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
ScalingActive False FailedGetPodsMetric the HPA was unable to compute the replica count: unable to get metric ddp_method_calls: no metrics returned from custom metrics API
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedGetPodsMetric 1m (x1052 over 17h) horizontal-pod-autoscaler unable to get metric ddp_method_calls: no metrics returned from custom metrics API
What you expected to happen:
Pods would be automatically scaled based on metrics that were available, ignoring missing metrics (treating them as zero).
A partial failure in the metrics system should not prevent auto-scaling from proceeding with the data that it has. For example, CPU % metrics come from the k8s metrics server whereas the other metrics come from the prometheus adapter. If the prometheus adapter goes away we can still use the CPU metric as a lower bound on the number of replicas.
If you consider, e.g. https://github.com/kubernetes/kubernetes/blob/e99ec245958f82acb2404f8597844d62b8f459c9/pkg/controller/podautoscaler/horizontal.go#L242 , instead of aborting the whole metric gathering process there, it could have stored a placeholder or zero metric value when it fails to fetch the metric.
How to reproduce it (as minimally and precisely as possible):
Setup a custom metrics server and tell HPA to scale based on several metrics, one of which is not actually available.
Environment:
kops 1.8 on AWS Client Version: version.Info{Major:“1”, Minor:“9”, GitVersion:“v1.9.2”, GitCommit:“5fa2db2bd46ac79e5e00a4e6ed24191080aa463b”, GitTreeState:“clean”, BuildDate:“2018-01-18T10:09:24Z”, GoVersion:“go1.9.2”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“8”, GitVersion:“v1.8.8”, GitCommit:“2f73858c9e6ede659d6828fe5a1862a48034a0fd”, GitTreeState:“clean”, BuildDate:“2018-02-09T21:23:25Z”, GoVersion:“go1.8.3”, Compiler:“gc”, Platform:“linux/amd64”}
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 2
- Comments: 24 (15 by maintainers)
This issue was discussed at SIG Autoscaling yesterday: https://docs.google.com/document/d/1RvhQAEIrVLHbyNnuaT99-6u9ZUMp7BfkPupT2LAZK7w/edit#heading=h.oh2koj9sbr3x
I’m going to raise a new PR later this week given the code’s changed so much since the original PR was raised by @bskiba that it doesn’t make sense to just rebase it.
@yastij: Reopened this issue.
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@DirectXMan12 as long as we failed to get one metric of all metrics, the func computeReplicasForMetrics returns, the scale lose efficacy, it is not highly available. when failed to get metric, should we record this event and do not return:)
/reopen /remove-lifecycle rotten /lifecycle frozen /priority backlog
/reopen