keda: Datadog scaler is not able to find matching metrics
Report
I have datadog scaler configured on AWS EKS Cluster with keda-2.6.1. I am using Nginx request per second metric for scaling it is working fine. Setup works fine as expected for few minutes. After that it starts throwing error about not able to find metrics. and It auto-recovers in few minutes. it stays unstable continuously.
Error events on HPA
AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
ScalingActive False FailedGetExternalMetric the HPA was unable to compute the replica count: unable to get external metric proxy-demo/s1-datadog-max-nginx-net-request_per_s/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: datadog-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: no matching metrics found for s1-datadog-max-nginx-net-request_per_s
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedGetExternalMetric 59s (x1494 over 6h15m) horizontal-pod-autoscaler unable to get external metric proxy-demo/s1-datadog-max-nginx-net-request_per_s/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: datadog-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: no matching metrics found for s1-datadog-max-nginx-net-request_per_s
Expected Behavior
Once it is able to fetch metrics from Datadog it should work in steady state.
Actual Behavior
It is throwing error about not able to fetch metrics and it auto-recover.
Steps to Reproduce the Problem
- deploy nginx proxy app
- deploy keda scaledobject with nginx RPS metrics
- Generate Traffic
- Wait 10 to 15 minutes
- Describe HPA object and it will show error events
Logs from KEDA operator
Error logs on keda-operator-metrics-api
E0217 16:35:40.580125 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"no matching metrics found for s1-datadog-max-nginx-net-request_per_s"}: no matching metrics found for s1-datadog-max-nginx-net-request_per_s
E0217 16:35:55.656813 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"no matching metrics found for s1-datadog-max-nginx-net-request_per_s"}: no matching metrics found for s1-datadog-max-nginx-net-request_per_s
E0217 16:36:10.733747 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"no matching metrics found for s1-datadog-max-nginx-net-request_per_s"}: no matching metrics found for s1-datadog-max-nginx-net-request_per_s
KEDA Version
2.6.1
Kubernetes Version
1.21
Platform
Amazon Web Services
Scaler Details
Datadog
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: datadog-scaledobject
spec:
scaleTargetRef:
name: nginx
minReplicaCount: 1
maxReplicaCount: 3
pollingInterval: 15
cooldownPeriod: 10
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 10
triggers:
- type: datadog
metadata:
query: "avg:nginx.net.request_per_s{cluster:cluster1}.rollup(15)"
queryValue: "6"
# Optional: (Global or Average). Whether the target value is global or average per pod. Default: Average
type: "average"
# Optional: The time window (in seconds) to retrieve metrics from Datadog. Default: 90
age: "15"
authenticationRef:
name: keda-trigger-auth-datadog-secret
Anything else?
cc : @arapulido
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 19 (14 by maintainers)
Yes, I am already looking into this and I am working on some other improvements. This is not related to rate-limiting, though. If it was, it wouldn’t recover that fast. This is due to sometimes not getting a metric, and KEDA cancelling the context (and thus the HPA logs the warning).
I will work on a patch that makes this more resilient, and also to make it clearer in the error when the user hits rate-limiting.
Sounds good to me but would call it metricUnavailableValue
Yeah, I agree 0 could be misleading. Perhaps a default value could be provided as an argument? In my scenario, I could safely set it to 0. Others may want a different default value in the event that the metric is null.
One thing is for sure, the existing behavior is undesirable under most (all?) circumstances. Logging a warning that the metric is null seems appropriate, but breaking the trigger is not ideal. In my case, the HPA scaled pods up because of another trigger, and then never scaled them down because the null metric broke the comparison.