keda: Datadog scaler is not able to find matching metrics

Report

I have datadog scaler configured on AWS EKS Cluster with keda-2.6.1. I am using Nginx request per second metric for scaling it is working fine. Setup works fine as expected for few minutes. After that it starts throwing error about not able to find metrics. and It auto-recovers in few minutes. it stays unstable continuously.

Error events on HPA

AbleToScale     True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive   False   FailedGetExternalMetric  the HPA was unable to compute the replica count: unable to get external metric proxy-demo/s1-datadog-max-nginx-net-request_per_s/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: datadog-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: no matching metrics found for s1-datadog-max-nginx-net-request_per_s
  ScalingLimited  False   DesiredWithinRange       the desired count is within the acceptable range
Events:
  Type     Reason                   Age                     From                       Message
  ----     ------                   ----                    ----                       -------
  Warning  FailedGetExternalMetric  59s (x1494 over 6h15m)  horizontal-pod-autoscaler  unable to get external metric proxy-demo/s1-datadog-max-nginx-net-request_per_s/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: datadog-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: no matching metrics found for s1-datadog-max-nginx-net-request_per_s

Expected Behavior

Once it is able to fetch metrics from Datadog it should work in steady state.

Actual Behavior

It is throwing error about not able to fetch metrics and it auto-recover.

Steps to Reproduce the Problem

  1. deploy nginx proxy app
  2. deploy keda scaledobject with nginx RPS metrics
  3. Generate Traffic
  4. Wait 10 to 15 minutes
  5. Describe HPA object and it will show error events

Logs from KEDA operator

Error logs on keda-operator-metrics-api

E0217 16:35:40.580125       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"no matching metrics found for s1-datadog-max-nginx-net-request_per_s"}: no matching metrics found for s1-datadog-max-nginx-net-request_per_s
E0217 16:35:55.656813       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"no matching metrics found for s1-datadog-max-nginx-net-request_per_s"}: no matching metrics found for s1-datadog-max-nginx-net-request_per_s
E0217 16:36:10.733747       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"no matching metrics found for s1-datadog-max-nginx-net-request_per_s"}: no matching metrics found for s1-datadog-max-nginx-net-request_per_s

KEDA Version

2.6.1

Kubernetes Version

1.21

Platform

Amazon Web Services

Scaler Details

Datadog

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: datadog-scaledobject
spec:
  scaleTargetRef:
    name: nginx
  minReplicaCount: 1
  maxReplicaCount: 3
  pollingInterval: 15
  cooldownPeriod: 10
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 10
  triggers:
  - type: datadog
    metadata:
      query: "avg:nginx.net.request_per_s{cluster:cluster1}.rollup(15)"
      queryValue: "6"
      # Optional: (Global or Average). Whether the target value is global or average per pod. Default: Average
      type: "average"
      # Optional: The time window (in seconds) to retrieve metrics from Datadog. Default: 90
      age: "15"
    authenticationRef:
      name: keda-trigger-auth-datadog-secret

Anything else?

cc : @arapulido

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19 (14 by maintainers)

Most upvoted comments

Yes, I am already looking into this and I am working on some other improvements. This is not related to rate-limiting, though. If it was, it wouldn’t recover that fast. This is due to sometimes not getting a metric, and KEDA cancelling the context (and thus the HPA logs the warning).

I will work on a patch that makes this more resilient, and also to make it clearer in the error when the user hits rate-limiting.

Sounds good to me but would call it metricUnavailableValue

Yeah, I agree 0 could be misleading. Perhaps a default value could be provided as an argument? In my scenario, I could safely set it to 0. Others may want a different default value in the event that the metric is null.

One thing is for sure, the existing behavior is undesirable under most (all?) circumstances. Logging a warning that the metric is null seems appropriate, but breaking the trigger is not ideal. In my case, the HPA scaled pods up because of another trigger, and then never scaled them down because the null metric broke the comparison.