datadog-agent: Datadog agent probe.sh should not depend on integration checks status for healthchecks/livenessProbe
Describe what happened:
When an integration check fails, the execution of probe.sh by the docker healthchecks or kubernetes livenessProbes returns a non zero exit code.
The datadog agent container is then terminated and restarted, potentially resulting in a crash loop.
Describe what you expected:
The datadog agent container should not restart when integration checks fail.
Steps to reproduce the issue:
- Configure an http check to an url that does not exist or times out.
- Witness the datadog agent container being restarted in loop
Additional environment details (Operating System, Cloud provider, etc):
- Official docker image 6.2.1
- Google Kubernetes Engine 1.8.12.gke0 with Container Optimized OS
Workarounds
- Do not use
probe.shin docker healthchecks or kubernetes livenessProbes. - ~Use
/opt/datadog-agent/bin/agent/agent statusinstead ofprobe.sh~ This does not resolve the crashloop when a check is failing - ~When the check failure is caused by timeouts, increase
check_runners/DD_CHECK_RUNNERS(#1805)~ This does not resolve the crashloop even for checks timing out (tested the 6.3.0-rc3 docker image).
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 4
- Comments: 29 (18 by maintainers)
Also seeing the same behavior as reported. We’re using latest agent docker image, on Google Cloud, Kubernetes version 1.10.4-gke.2.
The workaround we applied was to set the liveness probe to:
And to increase the check runners to 16:
I’ll monitor it over the coming days to see if it remains stable.
Hi @devillexio,
as per https://github.com/DataDog/datadog-agent/issues/1830#issuecomment-428487318, I’ve been using the
probe.shfor more than a year without issues. I then assumed this could be closed.6.5.2 includes a revamp of the check scheduler, which makes healthcheck issues linked to the
collector-queuecomponent far less likely. Can you try to upgrade and see if everything goes better?(Increasing the number of check runners can still be a workaround if this continue to occur, but default has already been moved to 4 at this version)
Once we saw
tagger-dockerget stuck all the kubernetes metrics stopped being reported including the kubelet_check which was the first thing to page us. Kicking the pod would solve the issue until it happened again. The liveness probe is a bandaid, but now we have frequent datadog agent container restarts, which will still result in short windows of missing metrics. Either way we are stuck. (All of this behavior is new in datadog 6 compared to 5.)Correct, it’s a workaround, and if there are more blocking checks than runners it will still fail. We plan on increasing the # of check runners to more than 10 (we’re still experimenting to find the best compromise here) but we need some improvements around check scheduling before it happens. This part has started and will be available soon.
Another workaround is to reduce your check timeouts to < 15 sec - or whatever you pick for the check run period. This will allow checks to terminate on time, succeeding or otherwise. It’s not possible for every check but for TCP/HTTP probes that should be fine.
The long term fix is to change the logic of the collector-queue health check (health check as in internal health check, not the docker health check instruction), but this is waiting for other preliminary work to happen.
Flare received, thanks @pdecat. We’ll get back to you as soon as possible.
Hi @hkaj
I agree not using the probe.sh script is just a workaround.
For the time being, I’ve exposed the agent’s trace port:
and switched to a tcp probe:
It’s not only about http checks, the same goes for tcp checks as long as it times out, for exemple:
Hi @pdecat Thanks for the detailed report, and sorry you’re facing this problem.
We’re aware of the issue where
collector-queuebecomes unhealthy in some cases and gets the container killed, but would prefer solving its root cause rather than stop relying on the probe all together.Can you tell us more about this http check? It’s not supposed to make the collector queue unhealthy and I didn’t manage to reproduce with the instruction you sent and agent 6.2.1. Actually the easiest option to give us all the information would be to send us a flare: https://docs.datadoghq.com/agent/troubleshooting/#send-a-flare Thanks again.