datadog-agent: Datadog Agent Forwarder fails liveness probe when new spot instance joins cluster, causing multiple restarts

Output of the info page (if this is a bug)


==============
Agent (v6.0.0)
==============

  Status date: 2018-05-15 17:28:44.007680 UTC
  Pid: 326
  Python Version: 2.7.14
  Logs:
  Check Runners: 1
  Log Level: WARN

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    System UTC time: 2018-05-15 17:28:44.007680 UTC

  Host Info
  =========
    bootTime: 2018-05-15 17:21:48.000000 UTC
    kernelVersion: 4.4.115-k8s
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: 9.3
    procs: 72
    uptime: 414
    virtualizationRole: guest
    virtualizationSystem: xen

  Hostnames
  =========
    ec2-hostname: ip-172-31-99-8.us-west-2.compute.internal
    hostname: i-0820f63f3cb2ffdc0
    instance-id: i-0820f63f3cb2ffdc0
    socket-fqdn: cluster-datadog-w2pc6
    socket-hostname: cluster-datadog-w2pc6

=========
Collector
=========

  Running Checks
  ==============
    No checks have run yet

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  IntakeV1: 1
  RetryQueueSize: 0
  Success: 1

  API Keys status
  ===============
    https://6-0-0-app.agent.datadoghq.com,*************************4d886: API Key valid

==========
Logs-agent
==========

  logs-agent is not running

=========
DogStatsD
=========

  Event: 1

kubectl describe pod output

  Normal   Pulled                 3m               kubelet, ip-172-31-99-8.us-west-2.compute.internal  Successfully pulled image "datadog/agent:6.0.0"
  Warning  Unhealthy              2m               kubelet, ip-172-31-99-8.us-west-2.compute.internal  Liveness probe failed: Agent health: FAIL
=== 1 healthy components ===
healthcheck
=== 13 unhealthy components ===
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, collector-queue, dockerutil-event-dispatch, dogstatsd-main, forwarder, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker
Error: found 13 unhealthy components
  Normal   Created    2m (x3 over 3m)  kubelet, ip-172-31-99-8.us-west-2.compute.internal  Created container
  Warning  Unhealthy  2m (x4 over 2m)  kubelet, ip-172-31-99-8.us-west-2.compute.internal  Liveness probe failed: Agent health: FAIL
=== 13 healthy components ===
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, collector-queue, dockerutil-event-dispatch, dogstatsd-main, healthcheck, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker
=== 1 unhealthy components ===
forwarder
Error: found 1 unhealthy components
  Normal  Killing  2m (x2 over 2m)  kubelet, ip-172-31-99-8.us-west-2.compute.internal  Killing container with id docker://datadog:Container failed liveness probe.. Container will be killed and recreated.
  Normal  Pulled   2m (x2 over 2m)  kubelet, ip-172-31-99-8.us-west-2.compute.internal  Container image "datadog/agent:6.0.0" already present on machine
  Normal  Started  2m (x3 over 3m)  kubelet, ip-172-31-99-8.us-west-2.compute.internal  Started container

Describe what happened: We are running our cluster in AWS using with master nodes as on-demand instances, and worker nodes as spot instances. When we lose a spot node, and a new one rejoins, datadog frequently restarts on the node, about 6 times, until it runs normally again. The reason datadog is restarting is that the liveness probe fails, due to the forwarder being unhealthy.

Describe what you expected: The forwarder to be healthy when deployed on the fresh node.

Steps to reproduce the issue: On a kubernetes cluster running 1.8.7, with worker nodes on spot, and an autoscale group for the worker nodes, delete a worker node, and wait for a new node to join the cluster. Once the new node has joined the cluster, monitor the datadog pod scheduled on the node.

Additional environment details (Operating System, Cloud provider, etc): Cloud Provider: AWS Kubernetes Version: 1.8.7 Datadog Version: 6.0.0

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 15 (5 by maintainers)

Most upvoted comments

@willyyang @gtrembathcafex I simply updated the timeouts of the liveness probe in the daemon set deployment yaml and I’m not experiencing any issues anymore. I have some very high log-volume applications so I thought that might influence it. In my case I I increased initialDelaySeconds from 15 to 30 and timeoutSeconds from 1 to 5.

We use some datadog agent k8s config from ~3 years ago, however agent docker instance is recent (7.39.1). Had similar error. Had to increase initialDelaySeconds on the probe.sh livenessProbe to 60s. (30s was not enough). I tried switching to /health probe but it didn’t help.

moved from an exec based mechanism to an http-based one 5 months ago

@hkaj Does this mean the daemonset should not be using probe.sh for the liveness check? If so what should the probe be?