integrations-core: Unable to detect the kubelet URL automatically / cannot validate certificate
Output of the info page
Getting the status from the agent.
==============
Agent (v6.6.0)
==============
Status date: 2018-11-13 23:10:34.603102 UTC
Pid: 342
Python Version: 2.7.15
Logs:
Check Runners: 4
Log Level: info
Paths
=====
Config File: /etc/datadog-agent/datadog.yaml
conf.d: /etc/datadog-agent/conf.d
checks.d: /etc/datadog-agent/checks.d
Clocks
======
NTP offset: 1.461ms
System UTC time: 2018-11-13 23:10:34.603102 UTC
Host Info
=========
bootTime: 2018-11-08 08:50:28.000000 UTC
kernelVersion: 4.9.0-7-amd64
os: linux
platform: debian
platformFamily: debian
platformVersion: buster/sid
procs: 70
uptime: 133h51m42s
virtualizationRole: host
virtualizationSystem: kvm
Hostnames
=========
hostname: reverent-kapitsa-1us
socket-fqdn: datadog-agent-pxkhm
socket-hostname: datadog-agent-pxkhm
hostname provider: container
unused hostname providers:
aws: not retrieving hostname from AWS: the host is not an ECS instance, and other providers already retrieve non-default hostnames
configuration/environment: hostname is empty
gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname
=========
Collector
=========
Running Checks
==============
cpu
---
Instance ID: cpu [OK]
Total Runs: 114
Metric Samples: 6, Total: 678
Events: 0, Total: 0
Service Checks: 0, Total: 0
Average Execution Time : 0s
disk (1.4.0)
------------
Instance ID: disk:e5dffb8bef24336f [OK]
Total Runs: 114
Metric Samples: 190, Total: 21,660
Events: 0, Total: 0
Service Checks: 0, Total: 0
Average Execution Time : 197ms
docker
------
Instance ID: docker [OK]
Total Runs: 113
Metric Samples: 216, Total: 23,850
Events: 0, Total: 6
Service Checks: 1, Total: 113
Average Execution Time : 203ms
file_handle
-----------
Instance ID: file_handle [OK]
Total Runs: 114
Metric Samples: 5, Total: 570
Events: 0, Total: 0
Service Checks: 0, Total: 0
Average Execution Time : 0s
io
--
Instance ID: io [OK]
Total Runs: 113
Metric Samples: 39, Total: 4,380
Events: 0, Total: 0
Service Checks: 0, Total: 0
Average Execution Time : 0s
kubelet (2.2.0)
---------------
Instance ID: kubelet:d884b5186b651429 [ERROR]
Total Runs: 114
Metric Samples: 0, Total: 0
Events: 0, Total: 0
Service Checks: 0, Total: 0
Average Execution Time : 8ms
Error: Unable to detect the kubelet URL automatically.
Traceback (most recent call last):
File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/base/checks/base.py", line 366, in run
self.check(copy.deepcopy(self.instances[0]))
File "/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubelet/kubelet.py", line 113, in check
raise CheckException("Unable to detect the kubelet URL automatically.")
CheckException: Unable to detect the kubelet URL automatically.
kubernetes_apiserver
--------------------
Instance ID: kubernetes_apiserver [OK]
Total Runs: 113
Metric Samples: 0, Total: 0
Events: 0, Total: 0
Service Checks: 0, Total: 0
Average Execution Time : 11ms
load
----
Instance ID: load [OK]
Total Runs: 114
Metric Samples: 6, Total: 684
Events: 0, Total: 0
Service Checks: 0, Total: 0
Average Execution Time : 2ms
memory
------
Instance ID: memory [OK]
Total Runs: 113
Metric Samples: 17, Total: 1,921
Events: 0, Total: 0
Service Checks: 0, Total: 0
Average Execution Time : 0s
network (1.7.0)
---------------
Instance ID: network:2a218184ebe03606 [OK]
Total Runs: 114
Metric Samples: 74, Total: 8,754
Events: 0, Total: 0
Service Checks: 0, Total: 0
Average Execution Time : 9ms
ntp
---
Instance ID: ntp:b4579e02d1981c12 [OK]
Total Runs: 113
Metric Samples: 1, Total: 113
Events: 0, Total: 0
Service Checks: 1, Total: 113
Average Execution Time : 2ms
uptime
------
Instance ID: uptime [OK]
Total Runs: 114
Metric Samples: 1, Total: 114
Events: 0, Total: 0
Service Checks: 0, Total: 0
Average Execution Time : 2ms
========
JMXFetch
========
Initialized checks
==================
no checks
Failed checks
=============
no checks
=========
Forwarder
=========
CheckRunsV1: 113
Dropped: 0
DroppedOnInput: 0
Events: 0
HostMetadata: 0
IntakeV1: 11
Metadata: 0
Requeued: 0
Retried: 0
RetryQueueSize: 0
Series: 0
ServiceChecks: 0
SketchSeries: 0
Success: 237
TimeseriesV1: 113
API Keys status
===============
API key ending with 1ed66 on endpoint https://app.datadoghq.com: API Key valid
==========
Logs Agent
==========
container_collect_all
---------------------
Type: docker
Status: Pending
=========
DogStatsD
=========
Checks Metric Sample: 65,227
Event: 7
Events Flushed: 7
Number Of Flushes: 113
Series Flushed: 53,494
Service Check: 1,478
Service Checks Flushed: 1,578
Dogstatsd Metric Sample: 11,877
Additional environment details (Operating System, Cloud provider, etc):
Kubernetes 1.12 cluster on DigitalOcean.
Steps to reproduce the issue:
- Deploy the Datadog agent using the provider Kubernetes resources.
- View logs
Describe the results you received:
[ AGENT ] 2018-11-13 22:42:31 UTC | ERROR | (kubeutil.go:50 in GetKubeletConnectionInfo) | connection to kubelet failed: temporary failure in kubeutil, will retry later: try delay not elapsed yet
[ AGENT ] 2018-11-13 22:42:31 UTC | ERROR | (runner.go:289 in work) | Error running check kubelet: [{"message": "Unable to detect the kubelet URL automatically.", "traceback": "Traceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/base/checks/base.py\", line 366, in run\n self.check(copy.deepcopy(self.instances[0]))\n File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubelet/kubelet.py\", line 113, in check\n raise CheckException(\"Unable to detect the kubelet URL automatically.\")\nCheckException: Unable to detect the kubelet URL automatically.\n"}]
[...]
[ AGENT ] 2018-11-13 22:42:39 UTC | ERROR | (autoconfig.go:608 in collect) | Unable to collect configurations from provider Kubernetes: temporary failure in kubeutil, will retry later: cannot connect: https: "Get https://10.133.78.180:10250/pods: x509: cannot validate certificate for 10.133.78.180 because it doesn't contain any IP SANs", http: "Get http://10.133.78.180:10255/pods: dial tcp 10.133.78.180:10255: connect: connection refused"
[ AGENT ] 2018-11-13 22:42:39 UTC | INFO | (autoconfig.go:362 in initListenerCandidates) | kubelet listener cannot start, will retry: temporary failure in kubeutil, will retry later: cannot connect: https: "Get https://10.133.78.180:10250/pods: x509: cannot validate certificate for 10.133.78.180 because it doesn't contain any IP SANs", http: "Get http://10.133.78.180:10255/pods: dial tcp 10.133.78.180:10255: connect: connection refused"
Many dashboard entries remain empty.
Describe the results you expected:
No errors, access to kubelet, functional Kubernetes dashboard.
Additional information you deem important (e.g. issue happens only occasionally):
Seems to be the same problem as #1829, however that issue is closed. Hosted Kubernetes services like DigitalOcean do not allow editing the kubelet configuration as far as I know.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 32
- Comments: 73 (10 by maintainers)
Commits related to this issue
- Update EKS doc to reflect https://github.com/DataDog/integrations-core/issues/2582#issuecomment-470860868 — committed to jjshanks/the-monitor by deleted user 5 years ago
@mjhuber I opened a ticket on the Datadog issue tracker. Advice was to set
DD_KUBELET_TLS_VERIFY=falsefor now. Hopefully DO will start using real certificates for the Kubelet API.For anyone having this issue, I have been working through this with Datadogs Support Team.
It appears that AKS has changed the location of the Kubelet Client CA Cert, at least between ASK 1.16.7 and 1.16.9.
The certificate used by AKS is now located on the node at
/etc/kubernetes/certs/kubeletserver.crt.If you are use the helm charts you can set the following values and the new certificates should get loaded correctly.
After adding and deploying this config, my Datadog agents, Helm 2.3.18 and DockerImage 7.20.2, on AKS 1.16.9 is now working correctly. There are warnings thrown by the agent that the Certificate has no subjectAltName, but metrics are able to be sent to Datadog successfully.
Hopefully a more permanent fix will follow, but this is a good enough fix for now to work again, without having to set
DD_KUBELET_TLS_VERIFY=false.This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there’s been inactivity. Thank you for participating in the Datadog open source community.
Those of you deploying DD with Helm aka
helm upgrade datadog -f values.yaml datadog/datadogHere’s something you can copy into your
values.yaml, sampleIm running into this same issue using AWS EKS, using the EKS Optimized AMI Image for a worker node.
Using
DD_KUBELET_TLS_VERIFY=falseis not a solution for us: the problem seems to be that the read only port on kubelet is deprecated on recent versions of Kubernetes (I’m on 1.11)I suppose the Datadog agent should get stats from kubelet using a different method.
We’re using a custom DNS server using private dns zones on our vnet and have run into similar issue To fix it we:
dnsConfigmapping to add a search suffix of our private dns zoneDD_KUBERNETES_KUBELET_HOSTtofieldPath: spec.nodeNameJust wanted to relate my experience using EKS 1.17/eks.3:
Experienced this issue deploying using the instructions here: https://docs.datadoghq.com/agent/cluster_agent/setup/?tab=secret
I basically did this:
Eventually I noticed that the pods for both cluster-agent and node-agents weren’t mounting anything at /var/run/secrets/kubernetes.io/serviceaccount - resulting in a failure to auth to kubelet. The unable to detect kubelet URL error was actually a symptom of this problem.
This for me turned out to be quirk of the terraform kubernetes provider - the fix was to specify
automount_service_account_token = truefor both the cluster-agent deployment and the node-agent daemonset. Agents could then successfully auth to kubelets to get metrics and the spice began to flow.Note that I did not have to disable DD_KUBELET_TLS_VERIFY 🎉
Hi everyone,
There seems to be several problems here.
For @jcassee: As you mentioned, we previously suggested setting
kubelet_tls_verifyto false. We understand it’s not a great solution, security wise. As we can see from the logs, it seems that the issue is from the certificate. If we take a deeper look, we can see on this log:[ AGENT ] 2018-11-13 22:42:39 UTC | INFO | (autoconfig.go:362 in initListenerCandidates) | kubelet listener cannot start, will retry: temporary failure in kubeutil, will retry later: cannot connect: https: "Get https://10.133.78.180:10250/pods: x509: cannot validate certificate for 10.133.78.180 because it doesn't contain any IP SANs", http: "Get http://10.133.78.180:10255/pods: dial tcp 10.133.78.180:10255: connect: connection refused"The certificate cannot be validated because there is no SAN for the IP address of the node. The certificate most likely uses the hostname of the node, as its common Name. What you could do is deploy our agent to use the node name instead of the IP to connect to the kubelet, by modifying the daemonset. Replace:
By:
We are also in touch with DigitalOcean to suggest adding the node IP as a SAN in the certificate.
For @mjhuber Could you try the same work-around? Thanks!
For @PHameete and @praseodym We do try querying the Kubelet on the (read-only) port 10255 to get Kubernetes metrics: https://docs.datadoghq.com/agent/kubernetes/metrics/#kubelet But only after trying the HTTPS port (10250) which should always be open. I suspect the message you’re seeing happens after the https call fails.
We also have the kubernetes_state integration which queries the KSM pod and get these metrics: https://docs.datadoghq.com/agent/kubernetes/metrics/#kube-state-metrics Which are different.
Disabling the TLS verification should not be needed if the correct certificates are used. If it doesn’t work, we would be happy to investigate. Please reach out to our support team if needed: support@datadoghq.com
For @bendrucker Indeed, if your kubelet configuration doesn’t use this certificate
/etc/kubernetes/ca.crt, our integration won’t work out of the box. However, if you have access to the certificate and the key, you can mount them in the agent pod, and use these env vars to specify the new paths:DD_KUBELET_CLIENT_CA: to specify the path of the ca.crtDD_KUBELET_CLIENT_CRT: to specify the path of the crtDD_KUBELET_CLIENT_KEY: to specify the path of the keyPlease reach out to our support team if you need further details: support@datadoghq.com
For @sridhar81 If
DD_KUBELET_TLS_VERIFY=falsedoesn’t work. It might not be a certificate issue. Did you use the RBACs provided in our documentation? https://docs.datadoghq.com/agent/kubernetes/daemonset_setup/#configure-rbac-permissionsPlease reach out to our support team if more troubleshooting is needed: support@datadoghq.com
We’re seeing the same thing on EKS running 1.12. Setting
DD_KUBELET_TLS_VERIFY=falsedoes not work.Anybody got a workaround for this?
I’m seeing this issue in kubernetes v1.12 on digital ocean as well
If someone is looking how to deploy on AKS with
DD_KUBELET_TLS_VERIFYusing helm, here’s and handy command line:I can see my Kubernetes metrics in DD now!
For anyone who finds this issue while trying to setup datadog on EKS here are the changes I made to make it work
For anyone using terraform you can do this as long as you create a
dd_api_keyvarThis issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there’s been inactivity. Thank you for participating in the Datadog open source community.
Can confirm adding
DD_KUBELET_TLS_VERIFY=falsedoes indeed work.For posterity: https://github.com/chris-short/wingedblade/blob/master/datadog-agent.yaml#L35
@praseodym oh hi Mark 😉 Where did you find this recommendation? Because both the integrations page in Datadog, and the documentation pointed me towards a ‘standard’ kubernetes deployment that uses the kubelet readonly port.
Someone then pointed me to the helm chart for deploying datadog which uses the method you suggest and that works for me.
The solution mentioned by @mopalinski does not seem to work for AKS as the kublet uses a self-signed certificate. These helm release values worked for me with AKS 1.17.7:
Since this is the issue that pops up when I search for the following error:
Unable to find metrics_endpoint in config file or detect the kubelet URL automaticallyI figured I’d post the solution I found most straight forward.This issue still occurs on current version of agent (7.21.1) when following documentation while deploying on EKS (K8s 1.16). In my case this issue simply occurs because of deprecation of kubelet’s read only (unencrypted) endpoint (port 10255) which is what the agent defaults to when proper config for HTTPS endpoint is not provided.
The solution is quite simple. You need to mount the service account token and certificate and tell the agent where they are by doing the following:
automountServiceAccountTokenparameter totruein your daemonset specagentcontainer:DD_KUBELET_CLIENT_CAto/var/run/secrets/kubernetes.io/serviceaccount/ca.crtDD_BEARER_TOKEN_PATHto/var/run/secrets/kubernetes.io/serviceaccount/tokenSeems to me that this should be a default at this point considering the non-SSL endpoint is deprecated in more recent version of K8s and SSL is really more preferable in general.
Hello,
Multiple issues were reported over time in this issue, we’ve added a documentation dedicated to Kubernetes distribution specificities here (including AKS
spec.nodeName): https://docs.datadoghq.com/agent/kubernetes/distributions/?tab=helmOne note about
automountServiceAccountToken. It’strueby default since Kubernetes 1.5, that’s why it’s not included in our Helm chart. We’ll explicitly add it in case some hardening/setups change this tofalseby default.Feel free to open more dedicated issues or contact our support if your issue is not solved.
This is the case for my AKS clusters as well:
Changing to
- name: DD_KUBERNETES_KUBELET_HOST valueFrom: fieldRef: fieldPath: spec.nodeNameResolved the issue
I dug into the cause that @mopalinski suggested, and was able to verify that is the actual cause for the breakage on AKS. All this discussion about moving CA files and self signed certificates is incorrect and misleading.
The truth is that Datadog has never worked correctly on AKS, and has been silently relying on the unsecured Kubelet fallback port this whole time. The removal of this unsecure port has only revealed the truth that Datadog has been broken this whole time.
It appears this unsecured fallback port was removed at some point in the AKS 1.16 line, which is what ultimately revealed the problem with the Datadog agent. It’s trivially easy to verify this is the actual cause:
1.14:
1.16:
Datadog should hopefully prioritize a fix for this problem on their end, since it actually affects all versions of AKS.
Until that time, it seems like the simplest workaround is to set
DD_KUBELET_TLS_VERIFY=false, or to use the self-signed certificate as its own trusted CA.I am experiencing the same issue on AKS 1.16.9 as well and my DD_KUBERNETES_KUBELET_HOST is on status.hostIP
@jdamata doesn’t work for me on AKS 1.16.9 for a some reason.
For the people on AKS, I needed DD_KUBERNETES_KUBELET_HOST=status.hostIP
Just worked through this and wanted to share what it I understand it would take to not set
DD_KUBELET_TLS_VERIFY=false.We use typhoon which runs the kubelet via systemd. It disables the readonly port and passes
--authentication-token-webhook --authorization-mode=Webhookto enable bearer token auth with the kubelet API. We install the datadog agent via Helm and found that disabling TLS verification was all we needed to do in order to collect metrics without the read only port.I hopped onto a worker and tried
curl https://localhost:10250 --cacert /etc/kubernetes/ca.crt, thinking the kubelet api’s certs were signed with the same CA used to sign the apiserver’s certs. Turns out that’s not the case.https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/#client-and-serving-certificates
By default, the kubelet is creating a self-signed key/cert for its server on start. If you specified
--tls-{private-key,cert}-fileand provide the CA cert used to sign them to the client (i.e.curlor the datadog agent), it should work.Here’s some discussion about addressing this issue in kubeadm:
https://github.com/kubernetes/kubeadm/issues/1223
Given that the datadog role is effectively read-only, we felt the risks of unverified TLS were acceptable until we have an opportunity to look at ways to sign kubelet API certs with a known CA or have the kubelet write its CA cert out to disk.
This is still an issue, and the root problem is still the same. The method used for TLS verification by the Datadog image is still completely broken, and the most viable workaround at the moment is to just disable TLS verification.
For those using the Datadog Helm chart, you can fix it by setting
As a work around - disable tls - is ok, but not sure that it’s recommended production ready recommendation. otherwise -we this tls verification exist
Some more info for you on all on this from MS direct -
Even with AKS version 1.16.x, the kubelet is accessible over HTTP on port 10255 if the cluster is upgraded from a previous version. If you launch a new NodePool in this version, kubelet is not accessible over port 10255. This is mentioned on the following Github Issues page:
The plan to discontinue this has been rolled out, and is planned to be introduced in the upcoming versions : 1.18.x
I did a repro in my lab environment and found that the new version of AKS, does not allow access to Kubelet on plain HTTP, and that the port 10255 is discontinued.
I launched a cluster with version 1.17.5: PS C:\Users\rissing> k get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP
aks-nodepool1-20955998-vmss000000 Ready agent 17m v1.17.5 10.240.0.4
Then I tried to access the plain HTTP port 10255 for kubelet: root@aks-ssh:/# curl http://10.240.0.4:10255/pods curl: (7) Failed to connect to 10.240.0.4 port 10255: Connection refused
I can confirm on my nodepools upgraded to 1.16.x the kubelet checks do work if I have the DD_KUBELET_TLS_VERIFY=false set. However on brand new node pools I can’t get any access to the kubelet with via datadog.
I have just upgraded our AKS from 1.16.7, which was working fine with DataDog, to 1.16.9 and my Datadog Agents habe now stopped working too with this same error.
I have also just tried to use the latest Datadog helm charts (2.3.18) with latest image 7.19.0 which hasn’t fixed the issue
Baremetal k8s cluster here also. Installed via kubespray.
As mentionned by someone else previously, I managed to get around this error message by specifying the following environment variable in my daemonset manifest.
Me too, I try datadog for baremetal k8s cluster.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. Note that the issue will not be automatically closed, but this notification will remind us to investigate why there’s been inactivity. Thank you for participating in the Datadog open source community.