datadog-agent: [cluster-agent] metrics don't appear in external.metrics.k8s.io

Describe what happened: I have setup the cluster agent using the helm chart stable/datadog. When I query the external metrics end point I get empty list of resources.

$ kubectl get --raw /apis/external.metrics.k8s.io/v1beta1
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"external.metrics.k8s.io/v1beta1","resources":[]}

And the HPA stucks at the <unknown> value.

$ kubectl get hpa 
NAME          REFERENCE                TARGETS              MINPODS   MAXPODS   REPLICAS   AGE
statsd-demo   Deployment/statsd-demo   <unknown>/10 (avg)   1         10        1          50m

Output of status:

root@ddog-cluster-agent-84486db86-qbwrw:/# datadog-cluster-agent status
Getting the status from the agent.
==============================
Datadog Cluster Agent (v1.0.0)
==============================

  Status date: 2018-10-30 18:28:19.562472 UTC
  Pid: 1
  Check Runners: 4
  Log Level: WARNING

  Paths
  =====
    Config File: /etc/datadog-agent/datadog-cluster.yaml
    conf.d: /etc/datadog-agent/conf.d

  Clocks
  ======
    System UTC time: 2018-10-30 18:28:19.562472 UTC

  Hostnames
  =========
    ec2-hostname: ip-1xx-1xx-2xx-190.us-west-2.compute.internal
    hostname: i-0c1580d88cbec55c0
    instance-id: i-0c1580d88cbec55c0
    socket-fqdn: ddog-cluster-agent-84486db86-qbwrw
    socket-hostname: ddog-cluster-agent-84486db86-qbwrw
    hostname provider: aws
    unused hostname providers:
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

  Leader Election
  ===============
    Leader Election Status:  Failing
    Error: entity not found
    

  Custom Metrics Server
  =====================
    ConfigMap name: default/datadog-custom-metrics
    
    External Metrics
    ----------------
      Total: 0
      Valid: 0
      

=========
Collector
=========

  Running Checks
  ==============
    
    kubernetes_apiserver
    --------------------
        Instance ID: kubernetes_apiserver [WARNING]
        Total Runs: 15
        Metric Samples: 0, Total: 0
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s
        
        Warning: [Leader Election not enabled. Not running Kubernetes API Server check or collecting Kubernetes Events.]
          
    
  
=========
Forwarder
=========

  CheckRunsV1: 14
  Dropped: 0
  DroppedOnInput: 0
  Events: 0
  HostMetadata: 0
  IntakeV1: 1
  Metadata: 0
  Requeued: 0
  Retried: 0
  RetryQueueSize: 0
  Series: 0
  ServiceChecks: 0
  SketchSeries: 0
  Success: 29
  TimeseriesV1: 14

  API Keys status
  ===============
    API key ending with xxxxx on endpoint https://app.datadoghq.com: API Key valid

Describe what you expected: The metrics should appear in API end point and the HPA should detect the value according to it.

Steps to reproduce the issue: Install cluster agent by either helm chart or using manifests.

Additional environment details (Operating System, Cloud provider, etc):

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T18:02:47Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.3-eks", GitCommit:"58c199a59046dbf0a13a387d3491a39213be53df", GitTreeState:"clean", BuildDate:"2018-09-21T21:00:04Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Running platform version of EKS: eks.2

values.yaml used for helm install:

daemonset:
  useHostNetwork: true
  useHostPort: true
datadog:
  env:
    - name: DD_USE_DOGSTATSD
      value: "true"
    - name: DD_DOGSTATSD_PORT
      value: "8125"
    - name: DD_DOGSTATSD_NON_LOCAL_TRAFFIC
      value: "true"
  apiKey: "********************************"
  appKey: "****************************************"
clusterAgent:
  enabled: true
  token: "*****************************************"
  metricsProvider:
    enabled: true

HPA specification:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: statsd-demo
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: statsd-demo
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: External
    external:
      metricName: demoInGo.request.count_total
      metricSelector:
        matchLabels:
          appname: statsd-demo
      targetAverageValue: 10

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 15 (8 by maintainers)

Most upvoted comments

@Chili-Man thank you for sharing!

tl;dr: To answer your question, so long as there is a config without the labels 1.0.0 will not support it. As soon as the hpa manifest is updated to have labels, it will properly handle it. The fix was merged earlier today, and I sincerely apologies for the trouble. We are starting doing QA on the release which will include this.

To note: With the fix and in the new version, there will be an error message in the logs, however we still do not support autoscaling on metrics without labels.

Could you try using datadog/cluster-agent-dev:charlyf-hpa-labels to confirm that it does not process “bad configs” ?

More details: The Cluster Agent runs a leader election process in order to process the autoscalers using informers. When processing the autoscalers, we extract the metric name and the labels and we query Datadog to get the timestamp/value. Then, we store the results in a configmap so that other cluster agents (if running several replicas) can access values to serve to Kubernetes, it also helps reducing the number of calls to Datadog. When a cluster agent is not a leader, it will only read values from this config map when asked (by Kubernetes). Lastly, in order to avoid keeping deleted/outdated autoscaler configs in the ConfigMap there is a Garbage Collection process that runs every 5minutes that lists values from the cache of the informer and compares them with the content of the ConfigMap used to store the processed ones.

Hence, if a “bad” config is ever made (out of an update/creation) when the cluster agent is not the leader it will be processed during the GC and crash as we were trying to access a nil pointer (the labels, which are missing). If the cluster agent is the leader and a bad config is made, it will crash almost immediately as it tries to digest the config (and access the missing labels).

CharlyF on Nov 15, 2018