keda: Keda 2.5 does not cleanly update from 2.4
Report
There appears to be a bug that prevents a clean and safe upgrade from keda 2.4 to keda 2.5, possibly related to this PR which changed metric names or This one . This affects pre-existing ScaledObjects are are present at the time of the 2.4 upgrade.
The symptom would be that the HPA loop would be attempting to evaluate a metric which did not actually exist within the Kubernetes external metrics API. Below is a snippet of the output of a kubectl describe hpa where the new keda 2.5 format of metrics would be queried and the log output of the external-metrics API
Metrics: ( current / target )
"s1-prometheus-burrow_lag" (target average value): <unknown> / 120M
resource cpu on pods (as a percentage of request): 108% (3267m) / 100%
Warning FailedGetExternalMetric 3m6s (x791 over 3h22m) horizontal-pod-autoscaler unable to get external metric s001/s1-prometheus-burrow_lag/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name:,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for s1-prometheus-burrow_lag
kubectl get --raw '/apis/external.metrics.k8s.io/v1beta1/namespaces/s001/s1-prometheus-burrow_lag?labelSelector=app=myApp' | jq
Error from server: No matching metrics found for s1-prometheus-burrow_lag
Reverting to Keda 2.4 would immediately fix the issue and resume using the old names. When in an errored state, remediation was possible by deleting all ScaledObjects and recreating them. This appeared to cause a reconciliation for the recreated scaledObject to the point the new-style metric becomes available.
Expected Behavior
Expect Keda 2.5 to immediately and reliably work out of the box for existing scaledObject definitions.
Actual Behavior
Upgrading from Keda 2.4 to Keda 2.5 is disruptive for pre-existing scaledObject-managed HPAs. New style metrics are inaccessible from the keda metrics APIserver
Steps to Reproduce the Problem
- Deploy Keda 2.4
- Create scaledObjects of Prometheus
- Upgrade to Keda 2.5
This issue may be difficult to reproduce. This only occurred in 2 out of my 30 kubernetes clusters. But it consistently happened within those 2. I am entirely unclear as to why the 2 clusters persistently had the issue: they should be identically configured as the rest.
Logs from KEDA operator
Warning FailedGetExternalMetric 3m6s (x791 over 3h22m) horizontal-pod-autoscaler unable to get external metric s001/s1-prometheus-burrow_lag/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name=myApp:,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: No matching metrics found for s1-prometheus-burrow_lag
KEDA Version
2.5.0
Kubernetes Version
1.20
Platform
Other
Scaler Details
Prometheus
Anything else?
No response
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 19 (16 by maintainers)
Hi there.
Has there been any update on this issue?
We’re also seeing a similar issue. We have 3 clusters where we have upgraded from 2.4.0 to 2.5.0, and two of them are producing errors. We are using Azure Service Bus for the events, and getting this output As you see below, the metric “s1-azure-servicebus-st-xxx” is showing as ‘unknown’. .
This issue resurfaced on me again, after a several hour delay. This is the second time I have had the issue surface on me, each failing several hours after first release. I will note that following the incident of the first time, I had deleted and recreated all scaledObjects and HPA objects while already deployed to keda 2.5 to ensure that there wouldn’t be any potentially stale references left over. As I have had the issue a second time in multiple environments: this has not helped.
I support many environments, across these many environments I have two sets of behaviours:
For scenario 1) I have the following example where I DO see a mismatch between the enumeration of the available resources and what’s actually queryable. This behaviour is consistent and reproducible across
Just as part of writing this up, I note that if I restart the (already 2.5) keda metrics API server, it begins returning the correctly named metrics when enumerating, data works properly just the same
In 2) where keda actually fails I receive the following error messages and an inability to query the metrics.
Unfortunately I don’t have a live example of the API output from today, but when I was investigating this for the first time I had the following unusual output. I now wonder if there is a cache expiring, or maybe a leader election changing of some sort that is causing a revert of the metric names. I do believe the two metrics below of the new format were fixed as the result of deleting and re-creating the ScaledObject definition. I wonder if restarting the metrics API server would have again refreshed the metrics to the point they resolve correctly. But needing to restart keda every few hours remains undesirable behaviour 😃