datadog-agent: Datadog Agent Forwarder fails liveness probe when new spot instance joins cluster, causing multiple restarts
Output of the info page (if this is a bug)
==============
Agent (v6.0.0)
==============
Status date: 2018-05-15 17:28:44.007680 UTC
Pid: 326
Python Version: 2.7.14
Logs:
Check Runners: 1
Log Level: WARN
Paths
=====
Config File: /etc/datadog-agent/datadog.yaml
conf.d: /etc/datadog-agent/conf.d
checks.d: /etc/datadog-agent/checks.d
Clocks
======
System UTC time: 2018-05-15 17:28:44.007680 UTC
Host Info
=========
bootTime: 2018-05-15 17:21:48.000000 UTC
kernelVersion: 4.4.115-k8s
os: linux
platform: debian
platformFamily: debian
platformVersion: 9.3
procs: 72
uptime: 414
virtualizationRole: guest
virtualizationSystem: xen
Hostnames
=========
ec2-hostname: ip-172-31-99-8.us-west-2.compute.internal
hostname: i-0820f63f3cb2ffdc0
instance-id: i-0820f63f3cb2ffdc0
socket-fqdn: cluster-datadog-w2pc6
socket-hostname: cluster-datadog-w2pc6
=========
Collector
=========
Running Checks
==============
No checks have run yet
========
JMXFetch
========
Initialized checks
==================
no checks
Failed checks
=============
no checks
=========
Forwarder
=========
IntakeV1: 1
RetryQueueSize: 0
Success: 1
API Keys status
===============
https://6-0-0-app.agent.datadoghq.com,*************************4d886: API Key valid
==========
Logs-agent
==========
logs-agent is not running
=========
DogStatsD
=========
Event: 1
kubectl describe pod output
Normal Pulled 3m kubelet, ip-172-31-99-8.us-west-2.compute.internal Successfully pulled image "datadog/agent:6.0.0"
Warning Unhealthy 2m kubelet, ip-172-31-99-8.us-west-2.compute.internal Liveness probe failed: Agent health: FAIL
=== 1 healthy components ===
healthcheck
=== 13 unhealthy components ===
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, collector-queue, dockerutil-event-dispatch, dogstatsd-main, forwarder, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker
Error: found 13 unhealthy components
Normal Created 2m (x3 over 3m) kubelet, ip-172-31-99-8.us-west-2.compute.internal Created container
Warning Unhealthy 2m (x4 over 2m) kubelet, ip-172-31-99-8.us-west-2.compute.internal Liveness probe failed: Agent health: FAIL
=== 13 healthy components ===
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, collector-queue, dockerutil-event-dispatch, dogstatsd-main, healthcheck, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker
=== 1 unhealthy components ===
forwarder
Error: found 1 unhealthy components
Normal Killing 2m (x2 over 2m) kubelet, ip-172-31-99-8.us-west-2.compute.internal Killing container with id docker://datadog:Container failed liveness probe.. Container will be killed and recreated.
Normal Pulled 2m (x2 over 2m) kubelet, ip-172-31-99-8.us-west-2.compute.internal Container image "datadog/agent:6.0.0" already present on machine
Normal Started 2m (x3 over 3m) kubelet, ip-172-31-99-8.us-west-2.compute.internal Started container
Describe what happened: We are running our cluster in AWS using with master nodes as on-demand instances, and worker nodes as spot instances. When we lose a spot node, and a new one rejoins, datadog frequently restarts on the node, about 6 times, until it runs normally again. The reason datadog is restarting is that the liveness probe fails, due to the forwarder being unhealthy.
Describe what you expected: The forwarder to be healthy when deployed on the fresh node.
Steps to reproduce the issue: On a kubernetes cluster running 1.8.7, with worker nodes on spot, and an autoscale group for the worker nodes, delete a worker node, and wait for a new node to join the cluster. Once the new node has joined the cluster, monitor the datadog pod scheduled on the node.
Additional environment details (Operating System, Cloud provider, etc): Cloud Provider: AWS Kubernetes Version: 1.8.7 Datadog Version: 6.0.0
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 15 (5 by maintainers)
@willyyang @gtrembathcafex I simply updated the timeouts of the liveness probe in the daemon set deployment yaml and I’m not experiencing any issues anymore. I have some very high log-volume applications so I thought that might influence it. In my case I I increased initialDelaySeconds from 15 to 30 and timeoutSeconds from 1 to 5.
We use some datadog agent k8s config from ~3 years ago, however agent docker instance is recent (7.39.1). Had similar error. Had to increase
initialDelaySecondson theprobe.shlivenessProbe to 60s. (30s was not enough). I tried switching to/healthprobe but it didn’t help.@hkaj Does this mean the daemonset should not be using probe.sh for the liveness check? If so what should the probe be?