fluentd: Fluentd stops processing logs, but keeps running
We are seeing an issue where fluentd will stop processing logs after some time, but the parent and child processes seem to be running normally.
We are running fluentd in a docker container on a kubernetes cluster, mounting the docker log volume /var/log/containers on the host.
In a recent incident, we saw logs cease being forwarded to the sumologic output, but activity continued in the fluentd log until 12 minutes after that time, eventually no longer picking up new logs (e.g. “following tail of…”) at some point after that. containers.log.pos
continued being updated for 1 hour 13 minutes after the first sign of problems, until it stopped being updated.
Killing the fluentd child process gets everything going again.
Config, strace, lsof and sigdump included below.
Details:
-
fluentd or td-agent version. fluentd 0.12.37
-
Environment information, e.g. OS. host: 4.9.9-coreos-r1 container: debian jessie
-
Your configuration see attachments
Attachments: fluentd config lsof of child process sigdump of child process strace of child process fluentd log
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 15
- Comments: 30 (6 by maintainers)
Commits related to this issue
- fluentd-cloudwatch: Fix Fluentd hanging https://github.com/fluent/fluentd/issues/1630 — committed to mfornasa/charts by mfornasa 6 years ago
- Fix Fluentd hanging https://github.com/fluent/fluentd/issues/1630 (#5511) * fluentd-cloudwatch: Fix Fluentd hanging https://github.com/fluent/fluentd/issues/1630 * Update Chart.yaml — committed to helm/charts by mfornasa 6 years ago
- Fix Fluentd hanging https://github.com/fluent/fluentd/issues/1630 (#5511) * fluentd-cloudwatch: Fix Fluentd hanging https://github.com/fluent/fluentd/issues/1630 * Update Chart.yaml — committed to or1can/charts by mfornasa 6 years ago
- Fix Fluentd hanging https://github.com/fluent/fluentd/issues/1630 (#5511) * fluentd-cloudwatch: Fix Fluentd hanging https://github.com/fluent/fluentd/issues/1630 * Update Chart.yaml Signed-off-by: ... — committed to dysnix/helm-charts by mfornasa 6 years ago
@repeatedly since applying
enable_stat_watcher false
we have not seen fluentd hang. Thank you for the insight! I’m not sure what the implications are here, is this something that fluentd can work around or is it a deeper issue with the libraries that ruby runs on?I’m also seeing this for high-volume logs, even on the newest version of td-agent.
enable_stat_watcher
does not fix the issue. Any further information on this?https://github.com/fluent/fluentd-docs/commit/22a128998e5a098f67d6f93980ff0cbd6d2cd23d
I just updated in_tail article for
enable_stat_watcher
.I assume the problem is inotify scalability. Maybe, inotify is not designed for monitoring lots of files with frequent access. Or libev’s inotify usage.
I see something similar as well on 0.14.20 for logs in /var/log/containers on a kubernetes cluster. Eventually fluentd will not pickup any new logs in the folder. I have not gotten into looking at the pos file yet. I can say that restarting fluentd will pickup the logs and catchup with past history so I expect the pos file is not updating.
We are facing this issue with /var/log/containers/*.log from kubernetes; only a few logs get picked up…
To provide more datapoints here:
We seen a similar issue where the systemd input plugin just stopped moving forward. If you’re using a
pos
file and running in Kubernetes you could run a liveness command doing astat
on the log your tailing, stat on thepos
and if they diverge byx
returnexit 1
and add the liveness check to your pod spec.Sorry, I forgot to reply… From your log, one thread is stopped in
read
with inotify .I’m not sure which is the cause yet, but how about disabling inotify? If set
enable_stat_watcher false
, fluentd doesn’t use inotify’s watcher for files.