datadog-agent: Cluster Agent cannot reconcile webhook
Output of the info page (if this is a bug)
2022-01-04 15:47:36 UTC | CORE | WARN | (pkg/util/log/log.go:630 in func1) | Deactivating Autoconfig will disable most components. It's recommended to use autoconfig_exclude_features and autoconfig_include_features to activate/deactivate features selectively
2022-01-04 15:47:36 UTC | CORE | INFO | (cmd/system-probe/config/config.go:119 in Merge) | no config exists at /etc/datadog-agent/system-probe.yaml, ignoring...
Getting the status from the agent.
===============
Agent (v7.32.3)
===============
Status date: 2022-01-04 15:47:36.933 UTC (1641311256933)
Agent start: 2022-01-04 15:46:59.953 UTC (1641311219953)
Pid: 1
Go Version: go1.16.7
Python Version: 3.8.11
Build arch: amd64
Agent flavor: agent
Check Runners: 4
Log Level: INFO
Paths
=====
Config File: /etc/datadog-agent/datadog.yaml
conf.d: /etc/datadog-agent/conf.d
checks.d: /etc/datadog-agent/checks.d
Clocks
======
NTP offset: -3.602ms
System time: 2022-01-04 15:47:36.933 UTC (1641311256933)
Host Info
=========
bootTime: 2022-01-03 09:07:50 UTC (1641200870000)
kernelArch: x86_64
kernelVersion: 5.4.0-1064-azure
os: linux
platform: ubuntu
platformFamily: debian
platformVersion: 21.04
procs: 219
uptime: 30h39m16s
virtualizationRole: host
virtualizationSystem: kvm
Hostnames
=========
host_aliases: [6f2277ad-0ffe-4bcc-ad0a-497915c1b7ac aks-common-76155617-vmss000000-alg-m3-test-aks]
hostname: aks-common-76155617-vmss000000-alg-m3-test-aks
socket-fqdn: datadog-agent-5m7l5
socket-hostname: datadog-agent-5m7l5
host tags:
cluster_name:alg-m3-test-aks
kube_cluster_name:alg-m3-test-aks
hostname provider: container
unused hostname providers:
aws: not retrieving hostname from AWS: the host is not an ECS instance and other providers already retrieve non-default hostnames
azure: azure_hostname_style is set to 'os'
configuration/environment: hostname is empty
gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname
Metadata
========
cloud_provider: Azure
hostname_source: container
=========
Collector
=========
Running Checks
==============
containerd
----------
Instance ID: containerd [[32mOK[0m]
Configuration Source: file:/etc/datadog-agent/conf.d/containerd.d/conf.yaml.default
Total Runs: 2
Metric Samples: Last Run: 558, Total: 1,116
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 2
Average Execution Time : 318ms
Last Execution Date : 2022-01-04 15:47:21 UTC (1641311241000)
Last Successful Execution Date : 2022-01-04 15:47:21 UTC (1641311241000)
cpu
---
Instance ID: cpu [[32mOK[0m]
Configuration Source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
Total Runs: 2
Metric Samples: Last Run: 9, Total: 11
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 0s
Last Execution Date : 2022-01-04 15:47:28 UTC (1641311248000)
Last Successful Execution Date : 2022-01-04 15:47:28 UTC (1641311248000)
cri
---
Instance ID: cri [[32mOK[0m]
Configuration Source: file:/etc/datadog-agent/conf.d/cri.d/conf.yaml.default
Total Runs: 2
Metric Samples: Last Run: 54, Total: 108
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 13ms
Last Execution Date : 2022-01-04 15:47:35 UTC (1641311255000)
Last Successful Execution Date : 2022-01-04 15:47:35 UTC (1641311255000)
disk (4.4.0)
------------
Instance ID: disk:e5dffb8bef24336f [[32mOK[0m]
Configuration Source: file:/etc/datadog-agent/conf.d/disk.d/conf.yaml.default
Total Runs: 2
Metric Samples: Last Run: 712, Total: 1,424
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 56ms
Last Execution Date : 2022-01-04 15:47:27 UTC (1641311247000)
Last Successful Execution Date : 2022-01-04 15:47:27 UTC (1641311247000)
file_handle
-----------
Instance ID: file_handle [[32mOK[0m]
Configuration Source: file:/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default
Total Runs: 2
Metric Samples: Last Run: 5, Total: 10
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 0s
Last Execution Date : 2022-01-04 15:47:34 UTC (1641311254000)
Last Successful Execution Date : 2022-01-04 15:47:34 UTC (1641311254000)
io
--
Instance ID: io [[32mOK[0m]
Configuration Source: file:/etc/datadog-agent/conf.d/io.d/conf.yaml.default
Total Runs: 2
Metric Samples: Last Run: 78, Total: 102
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 0s
Last Execution Date : 2022-01-04 15:47:26 UTC (1641311246000)
Last Successful Execution Date : 2022-01-04 15:47:26 UTC (1641311246000)
kubelet (7.1.0)
---------------
Instance ID: kubelet:5bbc63f3938c02f4 [[32mOK[0m]
Configuration Source: file:/etc/datadog-agent/conf.d/kubelet.d/conf.yaml.default
Total Runs: 2
Metric Samples: Last Run: 1,083, Total: 2,108
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 4, Total: 8
Average Execution Time : 449ms
Last Execution Date : 2022-01-04 15:47:27 UTC (1641311247000)
Last Successful Execution Date : 2022-01-04 15:47:27 UTC (1641311247000)
load
----
Instance ID: load [[32mOK[0m]
Configuration Source: file:/etc/datadog-agent/conf.d/load.d/conf.yaml.default
Total Runs: 2
Metric Samples: Last Run: 6, Total: 12
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 0s
Last Execution Date : 2022-01-04 15:47:33 UTC (1641311253000)
Last Successful Execution Date : 2022-01-04 15:47:33 UTC (1641311253000)
memory
------
Instance ID: memory [[32mOK[0m]
Configuration Source: file:/etc/datadog-agent/conf.d/memory.d/conf.yaml.default
Total Runs: 2
Metric Samples: Last Run: 18, Total: 36
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 0s
Last Execution Date : 2022-01-04 15:47:25 UTC (1641311245000)
Last Successful Execution Date : 2022-01-04 15:47:25 UTC (1641311245000)
network (2.4.0)
---------------
Instance ID: network:d884b5186b651429 [[32mOK[0m]
Configuration Source: file:/etc/datadog-agent/conf.d/network.d/conf.yaml.default
Total Runs: 2
Metric Samples: Last Run: 79, Total: 158
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 13ms
Last Execution Date : 2022-01-04 15:47:32 UTC (1641311252000)
Last Successful Execution Date : 2022-01-04 15:47:32 UTC (1641311252000)
ntp
---
Instance ID: ntp:d884b5186b651429 [[32mOK[0m]
Configuration Source: file:/etc/datadog-agent/conf.d/ntp.d/conf.yaml.default
Total Runs: 1
Metric Samples: Last Run: 1, Total: 1
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 1, Total: 1
Average Execution Time : 29ms
Last Execution Date : 2022-01-04 15:47:06 UTC (1641311226000)
Last Successful Execution Date : 2022-01-04 15:47:06 UTC (1641311226000)
uptime
------
Instance ID: uptime [[32mOK[0m]
Configuration Source: file:/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default
Total Runs: 2
Metric Samples: Last Run: 1, Total: 2
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 0s
Last Execution Date : 2022-01-04 15:47:24 UTC (1641311244000)
Last Successful Execution Date : 2022-01-04 15:47:24 UTC (1641311244000)
========
JMXFetch
========
Information
==================
Initialized checks
==================
no checks
Failed checks
=============
no checks
=========
Forwarder
=========
Transactions
============
Cluster: 0
ClusterRole: 0
ClusterRoleBinding: 0
CronJob: 0
DaemonSet: 0
Deployment: 0
Dropped: 0
HighPriorityQueueFull: 0
Job: 0
Node: 0
PersistentVolume: 0
PersistentVolumeClaim: 0
Pod: 0
ReplicaSet: 0
Requeued: 0
Retried: 0
RetryQueueSize: 0
Role: 0
RoleBinding: 0
Service: 0
ServiceAccount: 0
StatefulSet: 0
Transaction Successes
=====================
Total number: 6
Successes By Endpoint:
check_run_v1: 2
intake: 2
series_v1: 2
API Keys status
===============
API key ending with b9643: API Key valid
==========
Endpoints
==========
https://app.datadoghq.com - API Key ending with:
- b9643
==========
Logs Agent
==========
Sending compressed logs in HTTPS to agent-http-intake.logs.datadoghq.com on port 443
BytesSent: 1.416464e+06
EncodedBytesSent: 43496
LogsProcessed: 1333
LogsSent: 1291
datadog/datadog-agent-5m7l5/init-config
---------------------------------------
- Type: file
Identifier: d52221bc34ee20b008768595ae19895f09fe83c816fb54164db65dcb4eb616d1
Path: /var/log/pods/datadog_datadog-agent-5m7l5_22ab947a-4f5a-45b8-a8a7-cc4b41c192cf/init-config/*.log
Status: Pending
BytesRead: 0
Average Latency (ms): 0
24h Average Latency (ms): 0
Peak Latency (ms): 0
24h Peak Latency (ms): 0
container_collect_all
---------------------
- Type: docker
Status: Pending
BytesRead: 0
Average Latency (ms): 0
24h Average Latency (ms): 0
Peak Latency (ms): 0
24h Peak Latency (ms): 0
datadog/datadog-agent-5m7l5/process-agent
-----------------------------------------
- Type: file
Identifier: 0dfeab4c2b292ce6d46e4c933819f7ee3b2022d6f19a74e36f9c053bed23ca15
Path: /var/log/pods/datadog_datadog-agent-5m7l5_22ab947a-4f5a-45b8-a8a7-cc4b41c192cf/process-agent/*.log
Status: Pending
BytesRead: 0
Average Latency (ms): 0
24h Average Latency (ms): 0
Peak Latency (ms): 0
24h Peak Latency (ms): 0
datadog/datadog-agent-5m7l5/agent
---------------------------------
- Type: file
Identifier: 9edc900f82e3ecc7c3f876ee3a8f76a30e3ffc51998d4f653f9a02c9a7956c75
Path: /var/log/pods/datadog_datadog-agent-5m7l5_22ab947a-4f5a-45b8-a8a7-cc4b41c192cf/agent/*.log
Status: Pending
BytesRead: 0
Average Latency (ms): 0
24h Average Latency (ms): 0
Peak Latency (ms): 0
24h Peak Latency (ms): 0
kube-system/local-nvme-provisioner-nmf42/provisioner
----------------------------------------------------
- Type: file
Identifier: 44a728ccbecbcf5d406f217608dafaa7e31401828302891084c7a74bcae5112d
Path: /var/log/pods/kube-system_local-nvme-provisioner-nmf42_6b15fdc3-93f3-459c-ac35-4de83ea6439b/provisioner/*.log
Status: OK
1 files tailed out of 1 files matching
Inputs:
/var/log/pods/kube-system_local-nvme-provisioner-nmf42_6b15fdc3-93f3-459c-ac35-4de83ea6439b/provisioner/0.log
BytesRead: 150625
Average Latency (ms): 140
24h Average Latency (ms): 140
Peak Latency (ms): 527
24h Peak Latency (ms): 527
datadog/datadog-agent-5m7l5/init-volume
---------------------------------------
- Type: file
Identifier: 88ae3e02e78682babcf3e30529b9e02dfdd60d0e9d1614b504fb8a9c8d1b3c3a
Path: /var/log/pods/datadog_datadog-agent-5m7l5_22ab947a-4f5a-45b8-a8a7-cc4b41c192cf/init-volume/*.log
Status: Pending
BytesRead: 0
Average Latency (ms): 0
24h Average Latency (ms): 0
Peak Latency (ms): 0
24h Peak Latency (ms): 0
datadog/datadog-agent-5m7l5/trace-agent
---------------------------------------
- Type: file
Identifier: 24242ad0baa82636c471e84b78148d751d31a7804f6a22e2037d261d569b1e55
Path: /var/log/pods/datadog_datadog-agent-5m7l5_22ab947a-4f5a-45b8-a8a7-cc4b41c192cf/trace-agent/*.log
Status: Pending
BytesRead: 0
Average Latency (ms): 0
24h Average Latency (ms): 0
Peak Latency (ms): 0
24h Peak Latency (ms): 0
=========
APM Agent
=========
Status: Running
Pid: 1
Uptime: 36 seconds
Mem alloc: 17,203,264 bytes
Hostname: aks-common-76155617-vmss000000-***
Receiver: 0.0.0.0:8126
Endpoints:
https://trace.agent.datadoghq.com
Receiver (previous minute)
==========================
From go 1.17.5 (gc-amd64-linux), client v1.34.0
Traces received: 11 (4,852 bytes)
Spans received: 11
Default priority sampling rate: 100.0%
Priority sampling rate for 'service:api,env:test': 100.0%
Priority sampling rate for 'service:db,env:test': 100.0%
Writer (previous minute)
========================
Traces: 0 payloads, 0 traces, 0 events, 0 bytes
Stats: 0 payloads, 0 stats buckets, 0 bytes
=========
Aggregator
=========
Checks Metric Sample: 5,528
Dogstatsd Metric Sample: 438
Event: 1
Events Flushed: 1
Number Of Flushes: 2
Series Flushed: 3,245
Service Check: 35
Service Checks Flushed: 32
=========
DogStatsD
=========
Event Packets: 0
Event Parse Errors: 0
Metric Packets: 437
Metric Parse Errors: 0
Service Check Packets: 0
Service Check Parse Errors: 0
Udp Bytes: 73,515
Udp Packet Reading Errors: 0
Udp Packets: 275
Uds Bytes: 0
Uds Origin Detection Errors: 0
Uds Packet Reading Errors: 0
Uds Packets: 1
Unterminated Metric Errors: 0
=====================
Datadog Cluster Agent
=====================
- Datadog Cluster Agent endpoint detected: https://172.16.60.116:5005
Successfully connected to the Datadog Cluster Agent.
- Running: 1.16.0+commit.9961689
=============
Autodiscovery
=============
Enabled Features
================
containerd
cri
kubernetes
Describe what happened:
After upgrading the datadog helm chart to version 2.28.11 (datadog 7.32.3, DCA 1.16.0), we’re getting the following errors from the cluster agent: CLUSTER | INFO | (pkg/clusteragent/admission/controllers/webhook/controller_base.go:170 in processNextWorkItem) | Couldn't reconcile Webhook datadog-webhook: Operation cannot be fulfilled on mutatingwebhookconfigurations.admissionregistration.k8s.io "datadog-webhook": the object has been modified; please apply your changes to the latest version and try again.
Deleting and reinstalling the datadog helm chart does not fix the issue. Downgrading to version 2.22.10 (datadog 7.31.1, DCA 1.15.1) fixes the issue though.
Describe what you expected: We expect the cluster agent to work nominally and not throw errors about the admission controller webhook.
Steps to reproduce the issue:
Deploy datadog via the helm chart in version 2.28.11 (or any 2.28 patch) with the following values:
datadog:
kubelet:
tlsVerify: false
logs:
enabled: true
containerCollectAll: true
apm:
portEnabled: true
env:
- name: DD_CONTAINER_EXCLUDE_LOGS
value: "image:mcr.microsoft.com/.*" # Exclude kube-proxy (mcr.microsoft.com/oss/kubernetes/kube-proxy)
systemProbe:
collectDNSStats: false
clusterAgent:
admissionController:
enabled: true
mutateUnlabelled: true
Additional environment details (Operating System, Cloud provider, etc): Running on AKS with kubernetes v1.21.2
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 9
- Comments: 31 (2 by maintainers)
This was resolved here via the addition of an
envarspecific to AKS.There are 2 potential workarounds:
clusterAgent.admissionControllerto false https://github.com/DataDog/helm-charts/blob/main/charts/datadog/values.yaml#L858OR
envarDD_ADMISSION_CONTROLLER_ADD_AKS_SELECTORStotruein the clusterAgent section of your Helm chart.Either of those two options will remove those errors
We will get this updated in the documentation.
Same for chart version v2.37.7
I am also facing the issue with Helm chart version 3.1.9 with Cluster Agent 7.39.2.
Part of the debug logs from the cluster-agent
I am experiencing the same with datadog helm chart - 2.30.17
This seems to be broken again, at least on AKS 1.25.11. Assuming you’ve set the env variable from the workaround above the agent creates the webhook with:
While the resource after Azure modifies it has
The agent then just spins trying to reconcile the webhook several times a second forever.
@emily-zall’s solution is how I corrected my error in GKE Autopilot. I’m unsure how I ended up with 2 replicasets, but once I removed the one that had 0 desired, the error stopped for me.
I’m putting my exact error message so it can help others find the solution:
Couldn’t reconcile Secret default/webhook-certificate: secrets is forbidden: User “system:serviceaccount:datadog:datadog-cluster-agent” cannot create resource “secrets” in API group “” in the namespace “default”
UPDATE/EDIT: The reason I was seeing the old replica set was due to the fact that we use ArgoCD to deploy the agent. Based on the default
revisionHistoryLimitvalue of the replica set, it was leaving the old ones in place (the default is 10). I set the value clusterAgent.revisionHistoryLimit to 0, which kept the replica set from saving the history on argocd changes.Any update to this?
Same problem:
Edit: A word of warning. We have our AKS clusters configured to send Diagnostic Settings logs to Azure Monitor Log Analytics. In particular the logs for
kube-auditorkube-audit-adminwill pick up these DD errors, because they show up as aUpdateevent in the cluster against themutatingwebhookconfigurationresource. This was costing us a lot of money in Log Analytics, because these errors are very frequent. For now, we’ve had to disable the Cluster Agent’s Admission Controller feature. This stopped the excessive logging in the Cluster Agent pod as well as stopped the excessiveupdateevents being sent to Log Analytics.Same problem here:
AKS kubernetes version:1.23.8 cluster-agent:7.40.1 helm chart:datadog-3.3.1
This issue popped up when running two instances of the Helm chart in one cluster(one with APM enabled, the other disabled) and inadvertently running two instances of the cluster agent(wrong indent in YAML used for disabling the cluster agent). Once the second cluster agent was disabled, the problem was resolved.
I am also facing the issue with Helm chart version 3.1.10 with Cluster Agent 7.39.2. Running on AKS, kubernetes version 1.24.6
Part of the debug logs from the cluster-agent:
same issue in chart 3.1.8
Hi @clamoriniere, we still have the issue with
1.17.0(used in version2.35.0of the datadog helm chart)