integrations-core: Unable to detect the kubelet URL automatically / cannot validate certificate

Output of the info page

Getting the status from the agent.

==============
Agent (v6.6.0)
==============

  Status date: 2018-11-13 23:10:34.603102 UTC
  Pid: 342
  Python Version: 2.7.15
  Logs:
  Check Runners: 4
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: 1.461ms
    System UTC time: 2018-11-13 23:10:34.603102 UTC

  Host Info
  =========
    bootTime: 2018-11-08 08:50:28.000000 UTC
    kernelVersion: 4.9.0-7-amd64
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: buster/sid
    procs: 70
    uptime: 133h51m42s
    virtualizationRole: host
    virtualizationSystem: kvm

  Hostnames
  =========
    hostname: reverent-kapitsa-1us
    socket-fqdn: datadog-agent-pxkhm
    socket-hostname: datadog-agent-pxkhm
    hostname provider: container
    unused hostname providers:
      aws: not retrieving hostname from AWS: the host is not an ECS instance, and other providers already retrieve non-default hostnames
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

=========
Collector
=========

  Running Checks
  ==============

    cpu
    ---
        Instance ID: cpu [OK]
        Total Runs: 114
        Metric Samples: 6, Total: 678
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    disk (1.4.0)
    ------------
        Instance ID: disk:e5dffb8bef24336f [OK]
        Total Runs: 114
        Metric Samples: 190, Total: 21,660
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 197ms


    docker
    ------
        Instance ID: docker [OK]
        Total Runs: 113
        Metric Samples: 216, Total: 23,850
        Events: 0, Total: 6
        Service Checks: 1, Total: 113
        Average Execution Time : 203ms


    file_handle
    -----------
        Instance ID: file_handle [OK]
        Total Runs: 114
        Metric Samples: 5, Total: 570
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    io
    --
        Instance ID: io [OK]
        Total Runs: 113
        Metric Samples: 39, Total: 4,380
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    kubelet (2.2.0)
    ---------------
        Instance ID: kubelet:d884b5186b651429 [ERROR]
        Total Runs: 114
        Metric Samples: 0, Total: 0
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 8ms
        Error: Unable to detect the kubelet URL automatically.
        Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/base/checks/base.py", line 366, in run
          self.check(copy.deepcopy(self.instances[0]))
        File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubelet/kubelet.py", line 113, in check
          raise CheckException("Unable to detect the kubelet URL automatically.")
      CheckException: Unable to detect the kubelet URL automatically.

    kubernetes_apiserver
    --------------------
        Instance ID: kubernetes_apiserver [OK]
        Total Runs: 113
        Metric Samples: 0, Total: 0
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 11ms


    load
    ----
        Instance ID: load [OK]
        Total Runs: 114
        Metric Samples: 6, Total: 684
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 2ms


    memory
    ------
        Instance ID: memory [OK]
        Total Runs: 113
        Metric Samples: 17, Total: 1,921
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 0s


    network (1.7.0)
    ---------------
        Instance ID: network:2a218184ebe03606 [OK]
        Total Runs: 114
        Metric Samples: 74, Total: 8,754
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 9ms


    ntp
    ---
        Instance ID: ntp:b4579e02d1981c12 [OK]
        Total Runs: 113
        Metric Samples: 1, Total: 113
        Events: 0, Total: 0
        Service Checks: 1, Total: 113
        Average Execution Time : 2ms


    uptime
    ------
        Instance ID: uptime [OK]
        Total Runs: 114
        Metric Samples: 1, Total: 114
        Events: 0, Total: 0
        Service Checks: 0, Total: 0
        Average Execution Time : 2ms

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  CheckRunsV1: 113
  Dropped: 0
  DroppedOnInput: 0
  Events: 0
  HostMetadata: 0
  IntakeV1: 11
  Metadata: 0
  Requeued: 0
  Retried: 0
  RetryQueueSize: 0
  Series: 0
  ServiceChecks: 0
  SketchSeries: 0
  Success: 237
  TimeseriesV1: 113

  API Keys status
  ===============
    API key ending with 1ed66 on endpoint https://app.datadoghq.com: API Key valid

==========
Logs Agent
==========

  container_collect_all
  ---------------------
    Type: docker
    Status: Pending

=========
DogStatsD
=========

  Checks Metric Sample: 65,227
  Event: 7
  Events Flushed: 7
  Number Of Flushes: 113
  Series Flushed: 53,494
  Service Check: 1,478
  Service Checks Flushed: 1,578
  Dogstatsd Metric Sample: 11,877

Additional environment details (Operating System, Cloud provider, etc):

Kubernetes 1.12 cluster on DigitalOcean.

Steps to reproduce the issue:

  1. Deploy the Datadog agent using the provider Kubernetes resources.
  2. View logs

Describe the results you received:

[ AGENT ] 2018-11-13 22:42:31 UTC | ERROR | (kubeutil.go:50 in GetKubeletConnectionInfo) | connection to kubelet failed: temporary failure in kubeutil, will retry later: try delay not elapsed yet
[ AGENT ] 2018-11-13 22:42:31 UTC | ERROR | (runner.go:289 in work) | Error running check kubelet: [{"message": "Unable to detect the kubelet URL automatically.", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/base/checks/base.py\", line 366, in run\n    self.check(copy.deepcopy(self.instances[0]))\n  File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubelet/kubelet.py\", line 113, in check\n    raise CheckException(\"Unable to detect the kubelet URL automatically.\")\nCheckException: Unable to detect the kubelet URL automatically.\n"}]
[...]
[ AGENT ] 2018-11-13 22:42:39 UTC | ERROR | (autoconfig.go:608 in collect) | Unable to collect configurations from provider Kubernetes: temporary failure in kubeutil, will retry later: cannot connect: https: "Get https://10.133.78.180:10250/pods: x509: cannot validate certificate for 10.133.78.180 because it doesn't contain any IP SANs", http: "Get http://10.133.78.180:10255/pods: dial tcp 10.133.78.180:10255: connect: connection refused"
[ AGENT ] 2018-11-13 22:42:39 UTC | INFO | (autoconfig.go:362 in initListenerCandidates) | kubelet listener cannot start, will retry: temporary failure in kubeutil, will retry later: cannot connect: https: "Get https://10.133.78.180:10250/pods: x509: cannot validate certificate for 10.133.78.180 because it doesn't contain any IP SANs", http: "Get http://10.133.78.180:10255/pods: dial tcp 10.133.78.180:10255: connect: connection refused"

Many dashboard entries remain empty.

Describe the results you expected:

No errors, access to kubelet, functional Kubernetes dashboard.

Additional information you deem important (e.g. issue happens only occasionally):

Seems to be the same problem as #1829, however that issue is closed. Hosted Kubernetes services like DigitalOcean do not allow editing the kubelet configuration as far as I know.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 32
  • Comments: 73 (10 by maintainers)

Commits related to this issue

Most upvoted comments

@mjhuber I opened a ticket on the Datadog issue tracker. Advice was to set DD_KUBELET_TLS_VERIFY=false for now. Hopefully DO will start using real certificates for the Kubelet API.

For anyone having this issue, I have been working through this with Datadogs Support Team.

It appears that AKS has changed the location of the Kubelet Client CA Cert, at least between ASK 1.16.7 and 1.16.9.

The certificate used by AKS is now located on the node at /etc/kubernetes/certs/kubeletserver.crt.

If you are use the helm charts you can set the following values and the new certificates should get loaded correctly.

agents:
  volumes:
    - name: k8s-certs
      hostPath:
        path: /etc/kubernetes/certs
        type: ''
  volumeMounts:
    - name: k8s-certs
      readOnly: true
      mountPath: /etc/kubernetes/certs
datadog:
  env:
    - name: DD_KUBELET_CLIENT_CA
      value: /etc/kubernetes/certs/kubeletserver.crt

After adding and deploying this config, my Datadog agents, Helm 2.3.18 and DockerImage 7.20.2, on AKS 1.16.9 is now working correctly. There are warnings thrown by the agent that the Certificate has no subjectAltName, but metrics are able to be sent to Datadog successfully.

Hopefully a more permanent fix will follow, but this is a good enough fix for now to work again, without having to set DD_KUBELET_TLS_VERIFY=false.

This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there’s been inactivity. Thank you for participating in the Datadog open source community.

Those of you deploying DD with Helm aka helm upgrade datadog -f values.yaml datadog/datadog

Here’s something you can copy into your values.yaml, sample

  containers:
    agent:
      ## @param env - list - required
      ## Additional environment variables for the agent container.
      #
      env:
      - name: DD_KUBELET_TLS_VERIFY
        value: "false"

Im running into this same issue using AWS EKS, using the EKS Optimized AMI Image for a worker node.

Using DD_KUBELET_TLS_VERIFY=false is not a solution for us: the problem seems to be that the read only port on kubelet is deprecated on recent versions of Kubernetes (I’m on 1.11)

I suppose the Datadog agent should get stats from kubelet using a different method.

We’re using a custom DNS server using private dns zones on our vnet and have run into similar issue To fix it we:

  • applied the fix mentioned above
  • added a dnsConfig mapping to add a search suffix of our private dns zone
     dnsConfig:
         searches:
                 - foo.bar.com
    
  • Finally, updated DD_KUBERNETES_KUBELET_HOST to fieldPath: spec.nodeName

Just wanted to relate my experience using EKS 1.17/eks.3:

Experienced this issue deploying using the instructions here: https://docs.datadoghq.com/agent/cluster_agent/setup/?tab=secret

I basically did this:

  • Converted all manifests from yaml to hcl (I’m deploying using Terraform, yeah I know)
  • installed kube-state-metrics
  • spun wheels on this error for awhile

Eventually I noticed that the pods for both cluster-agent and node-agents weren’t mounting anything at /var/run/secrets/kubernetes.io/serviceaccount - resulting in a failure to auth to kubelet. The unable to detect kubelet URL error was actually a symptom of this problem.

This for me turned out to be quirk of the terraform kubernetes provider - the fix was to specify automount_service_account_token = true for both the cluster-agent deployment and the node-agent daemonset. Agents could then successfully auth to kubelets to get metrics and the spice began to flow.

Note that I did not have to disable DD_KUBELET_TLS_VERIFY 🎉

Hi everyone,

There seems to be several problems here.

For @jcassee: As you mentioned, we previously suggested setting kubelet_tls_verify to false. We understand it’s not a great solution, security wise. As we can see from the logs, it seems that the issue is from the certificate. If we take a deeper look, we can see on this log: [ AGENT ] 2018-11-13 22:42:39 UTC | INFO | (autoconfig.go:362 in initListenerCandidates) | kubelet listener cannot start, will retry: temporary failure in kubeutil, will retry later: cannot connect: https: "Get https://10.133.78.180:10250/pods: x509: cannot validate certificate for 10.133.78.180 because it doesn't contain any IP SANs", http: "Get http://10.133.78.180:10255/pods: dial tcp 10.133.78.180:10255: connect: connection refused"

The certificate cannot be validated because there is no SAN for the IP address of the node. The certificate most likely uses the hostname of the node, as its common Name. What you could do is deploy our agent to use the node name instead of the IP to connect to the kubelet, by modifying the daemonset. Replace:

- name: DD_KUBERNETES_KUBELET_HOST
           valueFrom:
             fieldRef:
               fieldPath: status.hostIP

By:

- name: DD_KUBERNETES_KUBELET_HOST
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName

We are also in touch with DigitalOcean to suggest adding the node IP as a SAN in the certificate.

For @mjhuber Could you try the same work-around? Thanks!

For @PHameete and @praseodym We do try querying the Kubelet on the (read-only) port 10255 to get Kubernetes metrics: https://docs.datadoghq.com/agent/kubernetes/metrics/#kubelet But only after trying the HTTPS port (10250) which should always be open. I suspect the message you’re seeing happens after the https call fails.

We also have the kubernetes_state integration which queries the KSM pod and get these metrics: https://docs.datadoghq.com/agent/kubernetes/metrics/#kube-state-metrics Which are different.

Disabling the TLS verification should not be needed if the correct certificates are used. If it doesn’t work, we would be happy to investigate. Please reach out to our support team if needed: support@datadoghq.com

For @bendrucker Indeed, if your kubelet configuration doesn’t use this certificate /etc/kubernetes/ca.crt, our integration won’t work out of the box. However, if you have access to the certificate and the key, you can mount them in the agent pod, and use these env vars to specify the new paths: DD_KUBELET_CLIENT_CA : to specify the path of the ca.crt DD_KUBELET_CLIENT_CRT : to specify the path of the crt DD_KUBELET_CLIENT_KEY : to specify the path of the key

Please reach out to our support team if you need further details: support@datadoghq.com

For @sridhar81 If DD_KUBELET_TLS_VERIFY=false doesn’t work. It might not be a certificate issue. Did you use the RBACs provided in our documentation? https://docs.datadoghq.com/agent/kubernetes/daemonset_setup/#configure-rbac-permissions

Please reach out to our support team if more troubleshooting is needed: support@datadoghq.com

We’re seeing the same thing on EKS running 1.12. Setting DD_KUBELET_TLS_VERIFY=false does not work.

Anybody got a workaround for this?

I’m seeing this issue in kubernetes v1.12 on digital ocean as well

If someone is looking how to deploy on AKS with DD_KUBELET_TLS_VERIFY using helm, here’s and handy command line:

 helm upgrade --install dd datadog/datadog  --set datadog.apiKey=<apikey> \
--set agents.containers.agent.env[0].name=DD_KUBELET_TLS_VERIFY  \
--set-string agents.containers.agent.env[0].value="false"

I can see my Kubernetes metrics in DD now!

For anyone who finds this issue while trying to setup datadog on EKS here are the changes I made to make it work

  1. as @Simwar noted For DD_KUBERNETES_KUBELET_HOST use spec.nodeName instead of status.hostIP
  2. Set automountServiceAccountToken to true on each container spec

For anyone using terraform you can do this as long as you create a dd_api_key var

resource "kubernetes_cluster_role" "datadog-cluster-agent" {
  metadata {
    name = "datadog-cluster-agent"
  }
  rule {
    api_groups = [""]
    resources  = ["services", "events", "endpoints", "pods", "nodes", "componentstatuses"]
    verbs      = ["get", "list", "watch"]
  }
  rule {
    api_groups = ["autoscaling"]
    resources  = ["horizontalpodautoscalers"]
    verbs      = ["list", "watch"]
  }
  rule {
    api_groups = [""]
    resources  = ["configmaps"]
    resource_names = ["datadogtoken", "datadog-leader-election"]
    verbs      = ["get", "update"]
  }
  rule {
    api_groups = [""]
    resources  = ["configmaps"]
    verbs      = ["create", "get", "update"]
  }
  rule {
    non_resource_urls = ["/version", "/healthz"]
    verbs = ["get"]
  }
}

resource "kubernetes_cluster_role_binding" "datadog-cluster-agent" {
  metadata {
    name = "datadog-cluster-agent"
  }
  role_ref {
    api_group = "rbac.authorization.k8s.io"
    kind = "ClusterRole"
    name = "datadog-cluster-agent"
  }
  subject {
    kind = "ServiceAccount"
    name = "datadog-cluster-agent"
  }
}

resource "kubernetes_service_account" "datadog-cluster-agent" {
  metadata {
    name = "datadog-cluster-agent"
  }
}

resource "kubernetes_cluster_role" "datadog-agent" {
  metadata {
    name = "datadog-agent"
  }
  rule {
    api_groups = [""]
    resources  = ["nodes/metrics", "nodes/spec", "nodes/proxy"]
    verbs      = ["get"]
  }
}

resource "kubernetes_cluster_role_binding" "datadog-agent" {
  metadata {
    name = "datadog-agent"
  }
  role_ref {
    api_group = "rbac.authorization.k8s.io"
    kind = "ClusterRole"
    name = "datadog-agent"
  }
  subject {
    kind = "ServiceAccount"
    name = "datadog-agent"
  }
}

resource "kubernetes_service_account" "datadog-agent" {
  metadata {
    name = "datadog-agent"
  }
}

resource "random_string" "datadog-auth-token" {
  keepers = {
    version = "1"
  }
  length = 32
  special = true
}

resource "kubernetes_secret" "datadog-auth-token" {
  metadata {
    name = "datadog-auth-token"
  }
  type = "Opaque"
  data = {
    token = base64encode(random_string.datadog-auth-token.result)
  }
}

resource "kubernetes_deployment" "datadog-cluster-agent" {
  metadata {
    name = "datadog-cluster-agent"
  }
  spec {
    selector {
      match_labels = {
        app = "datadog-cluster-agent"
      }
    }
    template {
      metadata {
        name = "datadog-cluster-agent"
        labels = {
          app = "datadog-cluster-agent"
        }
      }
      spec {
        service_account_name = kubernetes_service_account.datadog-cluster-agent.metadata[0].name
        automount_service_account_token = true
        container {
          name = "datadog-cluster-agent"
          image = "datadog/cluster-agent:latest"
          image_pull_policy = "Always"
          env {
            name = "DD_API_KEY"
            value = var.dd_api_key
          }
          env {
            name = "DD_COLLECT_KUBERNETES_EVENTS"
            value = "true"
          }
          env {
            name = "DD_LEADER_ELECTION"
            value = "true"
          }
          env {
            name = "DD_EXTERNAL_METRICS_PROVIDER_ENABLED"
            value = "true"
          }
          env {
            name = "DD_CLUSTER_AGENT_AUTH_TOKEN"
            value_from {
              secret_key_ref {
                name = kubernetes_secret.datadog-auth-token.metadata[0].name
                key = "token"
              }
            }
          }
        }
      }
    }
  }
}

resource "kubernetes_service" "datadog-cluster-agent" {
  metadata {
    name = "datadog-cluster-agent"
    labels = {
      app = "datadog-cluster-agent"
    }
  }
  spec {
    port {
      port = 5005
      protocol = "TCP"
    }
    selector = {
      app = "datadog-cluster-agent"
    }
  }
}

resource "kubernetes_daemonset" "datadog-agent" {
  metadata {
    name = "datadog-agent"
  }
  spec {
    selector {
      match_labels = {
        app = "datadog-agent"
      }
    }
    template {
      metadata {
        name = "datadog-agent"
        labels = {
          app = "datadog-agent"
        }
      }
      spec {
        service_account_name = kubernetes_service_account.datadog-agent.metadata[0].name
        automount_service_account_token = true
        container {
          name = "datadog-agent"
          image = "datadog/agent:latest"
          image_pull_policy = "Always"
          port {
            container_port = 8125
            name = "dogstatsdport"
            protocol = "UDP"
          }
          port {
            container_port = 8126
            name = "traceport"
            protocol = "TCP"
          }
          env {
            name = "DD_API_KEY"
            value = var.dd_api_key
          }
          env {
            name = "DD_COLLECT_KUBERNETES_EVENTS"
            value = "true"
          }
          env {
            name = "DD_LEADER_ELECTION"
            value = "true"
          }
          env {
            name = "KUBERNETES"
            value = "true"
          }
          env {
            name = "DD_KUBERNETES_KUBELET_HOST"
            value_from {
              field_ref {
                field_path = "spec.nodeName"
              }
            }
          }
          env {
            name = "DD_CLUSTER_AGENT_ENABLED"
            value = "true"
          }
          env {
            name = "DD_CLUSTER_AGENT_AUTH_TOKEN"
            value_from {
              secret_key_ref {
                name = kubernetes_secret.datadog-auth-token.metadata[0].name
                key = "token"
              }
            }
          }
          resources {
            requests {
              memory = "256Mi"
              cpu = "200m"
            }
            limits {
              memory = "256Mi"
              cpu = "200m"
            }
          }
          volume_mount {
            mount_path = "/var/run/docker.sock"
            name = "dockersocket"
          }
          volume_mount {
            mount_path = "/host/proc"
            name = "procdir"
            read_only = true
          }
          volume_mount {
            mount_path = "/host/sys/fs/cgroup"
            name = "cgroups"
            read_only = true
          }
          liveness_probe {
            exec {
              command = ["./probe.sh"]
            }
            initial_delay_seconds = 15
            period_seconds = 5
          }
        }
        volume {
          host_path {
            path = "/var/run/docker.sock"
          }
          name = "dockersocket"
        }
        volume {
          host_path {
            path = "/proc"
          }
          name = "procdir"
        }
        volume {
          host_path {
            path = "/sys/fs/cgroup"
          }
          name = "cgroups"
        }
      }
    }
  }
}

This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there’s been inactivity. Thank you for participating in the Datadog open source community.

Can confirm adding DD_KUBELET_TLS_VERIFY=false does indeed work.

For posterity: https://github.com/chris-short/wingedblade/blob/master/datadog-agent.yaml#L35

@praseodym oh hi Mark 😉 Where did you find this recommendation? Because both the integrations page in Datadog, and the documentation pointed me towards a ‘standard’ kubernetes deployment that uses the kubelet readonly port.

Someone then pointed me to the helm chart for deploying datadog which uses the method you suggest and that works for me.

The solution mentioned by @mopalinski does not seem to work for AKS as the kublet uses a self-signed certificate. These helm release values worked for me with AKS 1.17.7:

agents:
  image:
    tag: 7.21.1
  volumes:
    - name: k8s-certs
      hostPath:
        path: /etc/kubernetes/certs
        type: ''
  volumeMounts:
    - name: k8s-certs
      readOnly: true
      mountPath: /etc/kubernetes/certs
  containers:
    agent:
      env:
        - name: DD_KUBELET_CLIENT_CA
          value: /etc/kubernetes/certs/kubeletserver.crt

Since this is the issue that pops up when I search for the following error: Unable to find metrics_endpoint in config file or detect the kubelet URL automatically I figured I’d post the solution I found most straight forward.

This issue still occurs on current version of agent (7.21.1) when following documentation while deploying on EKS (K8s 1.16). In my case this issue simply occurs because of deprecation of kubelet’s read only (unencrypted) endpoint (port 10255) which is what the agent defaults to when proper config for HTTPS endpoint is not provided.

The solution is quite simple. You need to mount the service account token and certificate and tell the agent where they are by doing the following:

  1. setting automountServiceAccountToken parameter to true in your daemonset spec
  2. setting the following env vars for the agent container:
  • DD_KUBELET_CLIENT_CA to /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  • DD_BEARER_TOKEN_PATH to /var/run/secrets/kubernetes.io/serviceaccount/token

Seems to me that this should be a default at this point considering the non-SSL endpoint is deprecated in more recent version of K8s and SSL is really more preferable in general.

Hello,

Multiple issues were reported over time in this issue, we’ve added a documentation dedicated to Kubernetes distribution specificities here (including AKS spec.nodeName): https://docs.datadoghq.com/agent/kubernetes/distributions/?tab=helm

One note about automountServiceAccountToken. It’s true by default since Kubernetes 1.5, that’s why it’s not included in our Helm chart. We’ll explicitly add it in case some hardening/setups change this to false by default.

Feel free to open more dedicated issues or contact our support if your issue is not solved.

  • name: DD_KUBERNETES_KUBELET_HOST valueFrom: fieldRef: fieldPath: spec.nodeName

This is the case for my AKS clusters as well:

Changing to - name: DD_KUBERNETES_KUBELET_HOST valueFrom: fieldRef: fieldPath: spec.nodeName

Resolved the issue

I dug into the cause that @mopalinski suggested, and was able to verify that is the actual cause for the breakage on AKS. All this discussion about moving CA files and self signed certificates is incorrect and misleading.

The truth is that Datadog has never worked correctly on AKS, and has been silently relying on the unsecured Kubelet fallback port this whole time. The removal of this unsecure port has only revealed the truth that Datadog has been broken this whole time.

It appears this unsecured fallback port was removed at some point in the AKS 1.16 line, which is what ultimately revealed the problem with the Datadog agent. It’s trivially easy to verify this is the actual cause:

1.14:

# nc -z -v -w 1 "$DD_KUBERNETES_KUBELET_HOST" 10255
Connection to 10.240.0.9 10255 port [tcp/*] succeeded!

1.16:

# nc -z -v -w 1 "$DD_KUBERNETES_KUBELET_HOST" 10255
nc: connect to 10.240.0.5 port 10255 (tcp) failed: Connection refused

Datadog should hopefully prioritize a fix for this problem on their end, since it actually affects all versions of AKS.

Until that time, it seems like the simplest workaround is to set DD_KUBELET_TLS_VERIFY=false, or to use the self-signed certificate as its own trusted CA.

I am experiencing the same issue on AKS 1.16.9 as well and my DD_KUBERNETES_KUBELET_HOST is on status.hostIP

@jdamata doesn’t work for me on AKS 1.16.9 for a some reason.

For the people on AKS, I needed DD_KUBERNETES_KUBELET_HOST=status.hostIP

Just worked through this and wanted to share what it I understand it would take to not set DD_KUBELET_TLS_VERIFY=false.

We use typhoon which runs the kubelet via systemd. It disables the readonly port and passes --authentication-token-webhook --authorization-mode=Webhook to enable bearer token auth with the kubelet API. We install the datadog agent via Helm and found that disabling TLS verification was all we needed to do in order to collect metrics without the read only port.

I hopped onto a worker and tried curl https://localhost:10250 --cacert /etc/kubernetes/ca.crt, thinking the kubelet api’s certs were signed with the same CA used to sign the apiserver’s certs. Turns out that’s not the case.

https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/#client-and-serving-certificates

By default, the kubelet is creating a self-signed key/cert for its server on start. If you specified --tls-{private-key,cert}-file and provide the CA cert used to sign them to the client (i.e. curl or the datadog agent), it should work.

Here’s some discussion about addressing this issue in kubeadm:

https://github.com/kubernetes/kubeadm/issues/1223

Given that the datadog role is effectively read-only, we felt the risks of unverified TLS were acceptable until we have an opportunity to look at ways to sign kubelet API certs with a known CA or have the kubelet write its CA cert out to disk.

This is still an issue, and the root problem is still the same. The method used for TLS verification by the Datadog image is still completely broken, and the most viable workaround at the moment is to just disable TLS verification.

For those using the Datadog Helm chart, you can fix it by setting

datadog:
  kubelet:
    tlsVerify: false

As a work around - disable tls - is ok, but not sure that it’s recommended production ready recommendation. otherwise -we this tls verification exist

Some more info for you on all on this from MS direct -

Even with AKS version 1.16.x, the kubelet is accessible over HTTP on port 10255 if the cluster is upgraded from a previous version. If you launch a new NodePool in this version, kubelet is not accessible over port 10255. This is mentioned on the following Github Issues page:

The plan to discontinue this has been rolled out, and is planned to be introduced in the upcoming versions : 1.18.x

I did a repro in my lab environment and found that the new version of AKS, does not allow access to Kubelet on plain HTTP, and that the port 10255 is discontinued.

I launched a cluster with version 1.17.5: PS C:\Users\rissing> k get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP
aks-nodepool1-20955998-vmss000000 Ready agent 17m v1.17.5 10.240.0.4

Then I tried to access the plain HTTP port 10255 for kubelet: root@aks-ssh:/# curl http://10.240.0.4:10255/pods curl: (7) Failed to connect to 10.240.0.4 port 10255: Connection refused


I can confirm on my nodepools upgraded to 1.16.x the kubelet checks do work if I have the DD_KUBELET_TLS_VERIFY=false set. However on brand new node pools I can’t get any access to the kubelet with via datadog.

I have just upgraded our AKS from 1.16.7, which was working fine with DataDog, to 1.16.9 and my Datadog Agents habe now stopped working too with this same error.

I have also just tried to use the latest Datadog helm charts (2.3.18) with latest image 7.19.0 which hasn’t fixed the issue

Me too, I try datadog for baremetal k8s cluster.

 Datadog's kubelet integration is reporting:
Instance #kubelet:d884b5186b651429[ERROR]:[{"message": "Unable to detect the kubelet URL automatically.", "traceback": "Traceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/base/checks/base.py\", line 556, in run\n self.check(instance)\n File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubelet/kubelet.py\", line 179, in check\n raise CheckException(\"Unable to detect the kubelet URL automatically.\")\nCheckException: Unable to detect the kubelet URL automatically.\n"}]

Baremetal k8s cluster here also. Installed via kubespray.

As mentionned by someone else previously, I managed to get around this error message by specifying the following environment variable in my daemonset manifest.

        env:
          - name: DD_KUBELET_TLS_VERIFY
            value: "false"

Me too, I try datadog for baremetal k8s cluster.

 Datadog's kubelet integration is reporting:
Instance #kubelet:d884b5186b651429[ERROR]:[{"message": "Unable to detect the kubelet URL automatically.", "traceback": "Traceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/base/checks/base.py\", line 556, in run\n self.check(instance)\n File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubelet/kubelet.py\", line 179, in check\n raise CheckException(\"Unable to detect the kubelet URL automatically.\")\nCheckException: Unable to detect the kubelet URL automatically.\n"}]

This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there’s been inactivity. Thank you for participating in the Datadog open source community.