kubernetes: HPA refuses to scale if any custom metric is missing

/kind bug

What happened:

I setup custom metrics based autoscaling with multiple metrics. One of those metrics was not available. HPA reports all metrics as “unknown” (even the ones that are still available) and refuses to operate.

Metrics:                                               ( current / target )
  "sockjs_sessions_current" on pods:                   <unknown> / 500
  "ddp_method_calls" on pods:                          <unknown> / 25
  "http_requests" on pods:                             <unknown> / 25
  resource cpu on pods  (as a percentage of request):  <unknown> / 60%
Conditions:
  Type           Status  Reason               Message
  ----           ------  ------               -------
  AbleToScale    True    SucceededGetScale    the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetPodsMetric  the HPA was unable to compute the replica count: unable to get metric ddp_method_calls: no metrics returned from custom metrics API
Events:
  Type     Reason               Age                  From                       Message
  ----     ------               ----                 ----                       -------
  Warning  FailedGetPodsMetric  1m (x1052 over 17h)  horizontal-pod-autoscaler  unable to get metric ddp_method_calls: no metrics returned from custom metrics API

What you expected to happen:

Pods would be automatically scaled based on metrics that were available, ignoring missing metrics (treating them as zero).

A partial failure in the metrics system should not prevent auto-scaling from proceeding with the data that it has. For example, CPU % metrics come from the k8s metrics server whereas the other metrics come from the prometheus adapter. If the prometheus adapter goes away we can still use the CPU metric as a lower bound on the number of replicas.

If you consider, e.g. https://github.com/kubernetes/kubernetes/blob/e99ec245958f82acb2404f8597844d62b8f459c9/pkg/controller/podautoscaler/horizontal.go#L242 , instead of aborting the whole metric gathering process there, it could have stored a placeholder or zero metric value when it fails to fetch the metric.

How to reproduce it (as minimally and precisely as possible):

Setup a custom metrics server and tell HPA to scale based on several metrics, one of which is not actually available.

Environment:

kops 1.8 on AWS Client Version: version.Info{Major:“1”, Minor:“9”, GitVersion:“v1.9.2”, GitCommit:“5fa2db2bd46ac79e5e00a4e6ed24191080aa463b”, GitTreeState:“clean”, BuildDate:“2018-01-18T10:09:24Z”, GoVersion:“go1.9.2”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“8”, GitVersion:“v1.8.8”, GitCommit:“2f73858c9e6ede659d6828fe5a1862a48034a0fd”, GitTreeState:“clean”, BuildDate:“2018-02-09T21:23:25Z”, GoVersion:“go1.8.3”, Compiler:“gc”, Platform:“linux/amd64”}

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 2
Comments: 24 (15 by maintainers)

Most upvoted comments

This issue was discussed at SIG Autoscaling yesterday: https://docs.google.com/document/d/1RvhQAEIrVLHbyNnuaT99-6u9ZUMp7BfkPupT2LAZK7w/edit#heading=h.oh2koj9sbr3x

I’m going to raise a new PR later this week given the code’s changed so much since the original PR was raised by @bskiba that it doesn’t make sense to just rebase it.

gjtempleton on May 28, 2019

@yastij: Reopened this issue.

In response to this:

/reopen /remove-lifecycle rotten /lifecycle frozen /priority backlog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on May 2, 2019

@DirectXMan12 as long as we failed to get one metric of all metrics, the func computeReplicasForMetrics returns, the scale lose efficacy, it is not highly available. when failed to get metric, should we record this event and do not return:)

foxyriver on Mar 21, 2018

/reopen /remove-lifecycle rotten /lifecycle frozen /priority backlog

yastij on May 2, 2019

/reopen

gjtempleton on May 2, 2019