datadog-agent: Datadog agent probe.sh should not depend on integration checks status for healthchecks/livenessProbe

Describe what happened:

When an integration check fails, the execution of probe.sh by the docker healthchecks or kubernetes livenessProbes returns a non zero exit code. The datadog agent container is then terminated and restarted, potentially resulting in a crash loop.

Describe what you expected:

The datadog agent container should not restart when integration checks fail.

Steps to reproduce the issue:

Configure an http check to an url that does not exist or times out.
Witness the datadog agent container being restarted in loop

Additional environment details (Operating System, Cloud provider, etc):

Official docker image 6.2.1
Google Kubernetes Engine 1.8.12.gke0 with Container Optimized OS

Workarounds

Do not use probe.sh in docker healthchecks or kubernetes livenessProbes.
~Use /opt/datadog-agent/bin/agent/agent status instead of probe.sh~ This does not resolve the crashloop when a check is failing
~When the check failure is caused by timeouts, increase check_runners/DD_CHECK_RUNNERS (#1805)~ This does not resolve the crashloop even for checks timing out (tested the 6.3.0-rc3 docker image).

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 4
Comments: 29 (18 by maintainers)

Most upvoted comments

Also seeing the same behavior as reported. We’re using latest agent docker image, on Google Cloud, Kubernetes version 1.10.4-gke.2.

The workaround we applied was to set the liveness probe to:

        livenessProbe:
          exec:
            command:
            - ./probe.sh
          initialDelaySeconds: 30
          periodSeconds: 15

And to increase the check runners to 16:

          - name: DD_CHECK_RUNNERS
            value: "16"

I’ll monitor it over the coming days to see if it remains stable.

devillecodes on Jul 12, 2018

Hi @devillexio,

as per https://github.com/DataDog/datadog-agent/issues/1830#issuecomment-428487318, I’ve been using the probe.sh for more than a year without issues. I then assumed this could be closed.

pdecat on Oct 16, 2019

6.5.2 includes a revamp of the check scheduler, which makes healthcheck issues linked to the collector-queue component far less likely. Can you try to upgrade and see if everything goes better?

(Increasing the number of check runners can still be a workaround if this continue to occur, but default has already been moved to 4 at this version)

antoinepouille on Oct 10, 2018

Once we saw tagger-docker get stuck all the kubernetes metrics stopped being reported including the kubelet_check which was the first thing to page us. Kicking the pod would solve the issue until it happened again. The liveness probe is a bandaid, but now we have frequent datadog agent container restarts, which will still result in short windows of missing metrics. Either way we are stuck. (All of this behavior is new in datadog 6 compared to 5.)

wpalmeri on Jul 3, 2018

Correct, it’s a workaround, and if there are more blocking checks than runners it will still fail. We plan on increasing the # of check runners to more than 10 (we’re still experimenting to find the best compromise here) but we need some improvements around check scheduling before it happens. This part has started and will be available soon.

Another workaround is to reduce your check timeouts to < 15 sec - or whatever you pick for the check run period. This will allow checks to terminate on time, succeeding or otherwise. It’s not possible for every check but for TCP/HTTP probes that should be fine.

The long term fix is to change the logic of the collector-queue health check (health check as in internal health check, not the docker health check instruction), but this is waiting for other preliminary work to happen.

hkaj on Jun 21, 2018

Flare received, thanks @pdecat. We’ll get back to you as soon as possible.

dabcoder on Jun 18, 2018

Hi @hkaj

I agree not using the probe.sh script is just a workaround.

For the time being, I’ve exposed the agent’s trace port:

          - containerPort: 8126
            name: traceport
            protocol: TCP

and switched to a tcp probe:

        livenessProbe:
          tcpSocket:
            port: traceport

It’s not only about http checks, the same goes for tcp checks as long as it times out, for exemple:

      tcp_check.yaml: |
        init_config:

        instances:
          - name: somefailingcheck
            host: 10.0.0.42
            port: 2181
            timeout: 30
            collect_response_time: true
            skip_event: true

pdecat on Jun 15, 2018

Hi @pdecat Thanks for the detailed report, and sorry you’re facing this problem.

We’re aware of the issue where collector-queue becomes unhealthy in some cases and gets the container killed, but would prefer solving its root cause rather than stop relying on the probe all together.

Can you tell us more about this http check? It’s not supposed to make the collector queue unhealthy and I didn’t manage to reproduce with the instruction you sent and agent 6.2.1. Actually the easiest option to give us all the information would be to send us a flare: https://docs.datadoghq.com/agent/troubleshooting/#send-a-flare Thanks again.

hkaj on Jun 15, 2018