keda: Keda 2.5 does not cleanly update from 2.4

Report

There appears to be a bug that prevents a clean and safe upgrade from keda 2.4 to keda 2.5, possibly related to this PR which changed metric names or This one . This affects pre-existing ScaledObjects are are present at the time of the 2.4 upgrade.

The symptom would be that the HPA loop would be attempting to evaluate a metric which did not actually exist within the Kubernetes external metrics API. Below is a snippet of the output of a kubectl describe hpa where the new keda 2.5 format of metrics would be queried and the log output of the external-metrics API

Metrics:                                               ( current / target )
  "s1-prometheus-burrow_lag" (target average value):   <unknown> / 120M
  resource cpu on pods  (as a percentage of request):  108% (3267m) / 100%
  Warning  FailedGetExternalMetric  3m6s (x791 over 3h22m)  horizontal-pod-autoscaler  unable to get external metric s001/s1-prometheus-burrow_lag/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name:,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for s1-prometheus-burrow_lag
kubectl get --raw '/apis/external.metrics.k8s.io/v1beta1/namespaces/s001/s1-prometheus-burrow_lag?labelSelector=app=myApp' | jq
Error from server: No matching metrics found for s1-prometheus-burrow_lag

Reverting to Keda 2.4 would immediately fix the issue and resume using the old names. When in an errored state, remediation was possible by deleting all ScaledObjects and recreating them. This appeared to cause a reconciliation for the recreated scaledObject to the point the new-style metric becomes available.

Expected Behavior

Expect Keda 2.5 to immediately and reliably work out of the box for existing scaledObject definitions.

Actual Behavior

Upgrading from Keda 2.4 to Keda 2.5 is disruptive for pre-existing scaledObject-managed HPAs. New style metrics are inaccessible from the keda metrics APIserver

Steps to Reproduce the Problem

  1. Deploy Keda 2.4
  2. Create scaledObjects of Prometheus
  3. Upgrade to Keda 2.5

This issue may be difficult to reproduce. This only occurred in 2 out of my 30 kubernetes clusters. But it consistently happened within those 2. I am entirely unclear as to why the 2 clusters persistently had the issue: they should be identically configured as the rest.

Logs from KEDA operator

 Warning  FailedGetExternalMetric  3m6s (x791 over 3h22m)  horizontal-pod-autoscaler  unable to get external metric s001/s1-prometheus-burrow_lag/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name=myApp:,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for s1-prometheus-burrow_lag

KEDA Version

2.5.0

Kubernetes Version

1.20

Platform

Other

Scaler Details

Prometheus

Anything else?

No response

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 19 (16 by maintainers)

Most upvoted comments

Hi there.

Has there been any update on this issue?

We’re also seeing a similar issue. We have 3 clusters where we have upgraded from 2.4.0 to 2.5.0, and two of them are producing errors. We are using Azure Service Bus for the events, and getting this output As you see below, the metric “s1-azure-servicebus-st-xxx” is showing as ‘unknown’. .

kubectl describe hpa keda-hpa-file-xxx

Name:                                                              keda-hpa-file-xxx
Namespace:                                                         default
Labels:                                                            app.kubernetes.io/managed-by=Helm
                                                                   scaledobject.keda.sh/name=file-xxx
Annotations:                                                       <none>
CreationTimestamp:                                                 Fri, 10 Dec 2021 11:51:54 +0000
Reference:                                                         Deployment/file-xxx
Metrics:                                                           ( current / target )
  "s1-azure-servicebus-st-xxx" (target average value):   <unknown> / 5
  "s1-azure-servicebus-mi-xxx" (target average value):  0 / 5
Min replicas:                                                      1
Max replicas:                                                      15
Deployment pods:                                                   1 current / 1 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from external metric s1-azure-servicebus-mi-xxx(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: file-xxx,},MatchExpressions:[]LabelSelectorRequirement{},})
  ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count
Events:
  Type     Reason                   Age                    From                       Message
  ----     ------                   ----                   ----                       -------
  Warning  FailedGetExternalMetric  34s (x782 over 3h18m)  horizontal-pod-autoscaler  unable to get external metric default/s1-azure-servicebus-st-xxx/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: file-xxx,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for s1-azure-servicebus-st-xxx

This issue resurfaced on me again, after a several hour delay. This is the second time I have had the issue surface on me, each failing several hours after first release. I will note that following the incident of the first time, I had deleted and recreated all scaledObjects and HPA objects while already deployed to keda 2.5 to ensure that there wouldn’t be any potentially stale references left over. As I have had the issue a second time in multiple environments: this has not helped.

I support many environments, across these many environments I have two sets of behaviours:

  1. One where keda 2.5 works without issue
  2. One where keda 2.5 works for approximately 9~ hours before the new style metrics begin failing to resolve. This has happened 5 times now across 3 days and 3 environments.

For scenario 1) I have the following example where I DO see a mismatch between the enumeration of the available resources and what’s actually queryable. This behaviour is consistent and reproducible across

(⎈)➜  ~ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq

{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "prometheus-https---thanos-example-com-burrow_lag",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    },
}

(⎈)➜  ~ kubectl get --raw '/apis/external.metrics.k8s.io/v1beta1/namespaces/pool/s0-prometheus-burrow_lag?labelSelector=scaledobject.keda.sh/name=myApp' | jq
{
  "kind": "ExternalMetricValueList",
  "apiVersion": "external.metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "metricName": "s0-prometheus-burrow_lag",
      "metricLabels": null,
      "timestamp": "2021-12-05T23:30:10Z",
      "value": "0"
    }
  ]
}

Just as part of writing this up, I note that if I restart the (already 2.5) keda metrics API server, it begins returning the correctly named metrics when enumerating, data works properly just the same

(⎈)➜  ~ kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "s0-prometheus-burrow_lag",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    }

In 2) where keda actually fails I receive the following error messages and an inability to query the metrics.

apiVersion="autoscaling/v2beta2" type="Warning" reason="FailedGetExternalMetric" message="unable to get external metric s001/s2-prometheus-burrow_lag_sensor/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for s2-prometheus-burrow_lag_sensor"

Unfortunately I don’t have a live example of the API output from today, but when I was investigating this for the first time I had the following unusual output. I now wonder if there is a cache expiring, or maybe a leader election changing of some sort that is causing a revert of the metric names. I do believe the two metrics below of the new format were fixed as the result of deleting and re-creating the ScaledObject definition. I wonder if restarting the metrics API server would have again refreshed the metrics to the point they resolve correctly. But needing to restart keda every few hours remains undesirable behaviour 😃

friday morning example broken

kubectl get --raw '/apis/external.metrics.k8s.io/v1beta1' | jq '.resources[].name'
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag_sensor"
"prometheus-https---thanos-example-com-burrow_lag"
"s0-prometheus-burrow_lag_sensor"
"s1-prometheus-burrow_lag"
"prometheus-https---thanos-example-com-burrow_lag"