fluentd: Fluentd stops processing logs, but keeps running

We are seeing an issue where fluentd will stop processing logs after some time, but the parent and child processes seem to be running normally.

We are running fluentd in a docker container on a kubernetes cluster, mounting the docker log volume /var/log/containers on the host.

In a recent incident, we saw logs cease being forwarded to the sumologic output, but activity continued in the fluentd log until 12 minutes after that time, eventually no longer picking up new logs (e.g. “following tail of…”) at some point after that. containers.log.pos continued being updated for 1 hour 13 minutes after the first sign of problems, until it stopped being updated.

Killing the fluentd child process gets everything going again.

Config, strace, lsof and sigdump included below.

Details:

fluentd or td-agent version. fluentd 0.12.37
Environment information, e.g. OS. host: 4.9.9-coreos-r1 container: debian jessie
Your configuration see attachments

Attachments: fluentd config lsof of child process sigdump of child process strace of child process fluentd log

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 15
Comments: 30 (6 by maintainers)

Commits related to this issue

fluentd-cloudwatch: Fix Fluentd hanging https://github.com/fluent/fluentd/issues/1630 — committed to mfornasa/charts by mfornasa 6 years ago
Fix Fluentd hanging https://github.com/fluent/fluentd/issues/1630 (#5511) * fluentd-cloudwatch: Fix Fluentd hanging https://github.com/fluent/fluentd/issues/1630 * Update Chart.yaml — committed to helm/charts by mfornasa 6 years ago
Fix Fluentd hanging https://github.com/fluent/fluentd/issues/1630 (#5511) * fluentd-cloudwatch: Fix Fluentd hanging https://github.com/fluent/fluentd/issues/1630 * Update Chart.yaml — committed to or1can/charts by mfornasa 6 years ago
Fix Fluentd hanging https://github.com/fluent/fluentd/issues/1630 (#5511) * fluentd-cloudwatch: Fix Fluentd hanging https://github.com/fluent/fluentd/issues/1630 * Update Chart.yaml Signed-off-by: ... — committed to dysnix/helm-charts by mfornasa 6 years ago

Most upvoted comments

@repeatedly since applying enable_stat_watcher false we have not seen fluentd hang. Thank you for the insight! I’m not sure what the implications are here, is this something that fluentd can work around or is it a deeper issue with the libraries that ruby runs on?

goakley on Mar 23, 2018

I’m also seeing this for high-volume logs, even on the newest version of td-agent. enable_stat_watcher does not fix the issue. Any further information on this?

baconmania on May 14, 2020

https://github.com/fluent/fluentd-docs/commit/22a128998e5a098f67d6f93980ff0cbd6d2cd23d

I just updated in_tail article for enable_stat_watcher.

is it a deeper issue with the libraries that ruby runs on?

I assume the problem is inotify scalability. Maybe, inotify is not designed for monitoring lots of files with frequent access. Or libev’s inotify usage.

repeatedly on Mar 29, 2018

I see something similar as well on 0.14.20 for logs in /var/log/containers on a kubernetes cluster. Eventually fluentd will not pickup any new logs in the folder. I have not gotten into looking at the pos file yet. I can say that restarting fluentd will pickup the logs and catchup with past history so I expect the pos file is not updating.

cehoffman on Sep 16, 2017

We are facing this issue with /var/log/containers/*.log from kubernetes; only a few logs get picked up…

grandich on Jul 24, 2021

To provide more datapoints here:

td-agent 3.7.1 (fluentd v1.10.2)
input is in_tail
output is out_s3
our symptoms are similar
- when this does happen, the only way we know is by querying the log dataset shipped to S3 and finding gaps in the data
- td-agent logging completely stops, even at the trace level

baconmania on May 14, 2020

We seen a similar issue where the systemd input plugin just stopped moving forward. If you’re using a pos file and running in Kubernetes you could run a liveness command doing a stat on the log your tailing, stat on the pos and if they diverge by x return exit 1 and add the liveness check to your pod spec.

krx252525 on Jul 5, 2018

Sorry, I forgot to reply… From your log, one thread is stopped in read with inotify .

Thread 4 (Thread 0x7fb809bff700 (LWP 20596)):
#0  0x00007fb81c8c9a9d in read () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007fb8177c8545 in infy_cb (loop=0x7fb814872900, w=<optimized out>, revents=<optimized out>)
    at ../libev/ev.c:4470
#2  0x00007fb8177c3ae3 in ev_invoke_pending (loop=0x7fb814872900) at ../libev/ev.c:3319
#3  0x00007fb8177c7271 in ev_run (loop=0x7fb814872900, flags=flags@entry=2) at ../libev/ev.c:3809
#4  0x00007fb8177ca242 in ev_loop (flags=2, loop=<optimized out>) at ../libev/ev.h:835
#5  Coolio_Loop_run_once (argc=<optimized out>, argv=<optimized out>, self=140428616634920) at loop.c:206
#6  0x00007fb81cc98996 in vm_call_cfunc_with_frame (th=0x7fb8125bac00, reg_cfp=0x7fb809d0d1c0, ci=<optimized out>)
    at vm_insnhelper.c:1510
#7  0x00007fb81cc9cd26 in vm_exec_core (th=th@entry=0x7fb8125bac00, initial=initial@entry=0) at insns.def:1025
#8  0x00007fb81cca1a91 in vm_exec (th=th@entry=0x7fb8125bac00) at vm.c:1427
#9  0x00007fb81cca5610 in invoke_block_from_c (th=0x7fb8125bac00, block=0x7fb815c7d7d0, self=<optimized out>, 
    argc=<optimized out>, argv=<optimized out>, blockptr=<optimized out>, cref=0x0, defined_class=140428701407320)
    at vm.c:817
#10 0x00007fb81cca5b1b in vm_invoke_proc (th=th@entry=0x7fb8125bac00, proc=proc@entry=0x7fb815c7d7d0, 
    self=140428616664240, defined_class=140428701407320, argc=0, argv=0x7fb80fd5b830, blockptr=0x0) at vm.c:881
#11 0x00007fb81cca5c08 in rb_vm_invoke_proc (th=th@entry=0x7fb8125bac00, proc=proc@entry=0x7fb815c7d7d0, 
    argc=<optimized out>, argv=<optimized out>, blockptr=blockptr@entry=0x0) at vm.c:900
#12 0x00007fb81ccbaadf in thread_start_func_2 (th=th@entry=0x7fb8125bac00, stack_start=<optimized out>)
    at thread.c:535
#13 0x00007fb81ccbaeb3 in thread_start_func_1 (th_ptr=0x7fb8125bac00) at thread_pthread.c:863
#14 0x00007fb81c8c3064 in start_thread (arg=0x7fb809bff700) at pthread_create.c:309
#15 0x00007fb81bebc62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

I’m not sure which is the cause yet, but how about disabling inotify? If set enable_stat_watcher false, fluentd doesn’t use inotify’s watcher for files.

repeatedly on Mar 20, 2018