fluentd: in_tail pos file stops updating after log rotate
Sometimes (every few weeks) during a log rotation of the file it’s tailing fluentd will stop updating its .pos file. Even a day later when the log file rotates again it does not update the pos file. Even though the pos file is not being updated the file is being consumed and the output is being produced. The contents of the pos file are a few lines into the new log file and the inode number of the new log file.
The double detected rotation/following tail did not happen on previous days, however I have seen it on another server and it didn’t appear to cause a problem that time.
A restart of the service service td-agent restart
fixes the problem, but it reads the log from the start, which is less than ideal.
Version: td-agent 2.3.3 x64 OS: Redhat 6.8
/var/log/td-agent/td-agent.log.1
2016-12-20 03:08:03 +0000 [info]: detected rotation of /logs/application.log; waiting 5 seconds
2016-12-20 03:08:03 +0000 [info]: following tail of /logs/application.log
2016-12-20 03:08:03 +0000 [info]: detected rotation of /logs/application.log; waiting 5 seconds
2016-12-20 03:08:03 +0000 [info]: following tail of /logs/application.log
2016-12-20 03:11:37 +0000 [info]: force flushing buffered events
/var/log/td-agent/td-agent.log.2.gz
2016-12-19 03:27:03 +0000 [info]: detected rotation of /logs/application.log; waiting 5 seconds
2016-12-19 03:27:03 +0000 [info]: following tail of /logs/application.log
2016-12-19 03:32:12 +0000 [info]: force flushing buffered events
/var/log/td-agent/application.log.pos:
/logs/application.log 000000000000f16d 00000000000a419e
The pos file last updated on log rotation
ll /var/log/td-agent/application.log.pos
-rw-r--r--. 1 td-agent td-agent 80 Dec 20 03:08 application.log.pos
$ stat /logs/application.log
File: `/logs/application.log'
Size: 2188501633 Blocks: 4274432 IO Block: 4096 regular file
Device: fd00h/64768d Inode: 672158 Links: 1
strace of the fluentd process
[pid 15119] inotify_add_watch(14, "/logs/application.log", IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE|IN_DELETE_SELF|IN_MOVE_SELF|IN_DONT_FOLLOW|IN_MASK_ADD) = 102215122
[pid 15119] statfs("/logs/application.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=5845471, f_bfree=4475644, f_bavail=4177046, f_files=1493280, f_ffree=1288943, f_fsid={-615535995, 789078184}, f_namelen=255, f_frsize=4096}) = 0
[pid 15119] lstat("/logs/application.log", {st_mode=S_IFREG|0644, st_size=2100168079, ...}) = 0
[pid 15119] stat("/logs/application.log", {st_mode=S_IFREG|0644, st_size=2100168079, ...}) = 0
[pid 15119] read(25, "[20/Dec/2016:09:31:59.667731 +00"..., 2048) = 252
[pid 15119] read(25, "", 2048) = 0
[pid 15119] lseek(25, 0, SEEK_CUR) = 2100168079
[pid 15119] clock_gettime(CLOCK_MONOTONIC, {4149597, 999687012}) = 0
[pid 15119] epoll_wait(13, {{EPOLLIN, {u32=14, u64=4294967310}}}, 64, 123) = 1
[pid 15119] clock_gettime(CLOCK_MONOTONIC, {4149597, 999728739}) = 0
[pid 15119] read(14, "\321\255\27\6\0\200\0\0\0\0\0\0\0\0\0\0", 287) = 16
[pid 15119] clock_gettime(CLOCK_MONOTONIC, {4149597, 999767484}) = 0
[pid 15119] epoll_wait(13, {{EPOLLIN, {u32=14, u64=4294967310}}}, 64, 123) = 1
[pid 15119] clock_gettime(CLOCK_MONOTONIC, {4149598, 19991561}) = 0
[pid 15119] read(14, "\322\255\27\6\2\0\0\0\0\0\0\0\0\0\0\0", 287) = 16
[pid 15119] lstat("/logs/application.log", {st_mode=S_IFREG|0644, st_size=2100168990, ...}) = 0
[pid 15119] inotify_rm_watch(14, 102215122) = 0
Config:
<source>
@type tail
format none
path /logs/application.log
pos_file /var/log/td-agent/application.log.pos
tag system.event
</source>
<filter system.event>
@type parser
format /\[SYSEVT\] (?<data>.+)/
key_name message
suppress_parse_error_log
</filter>
<filter system.event>
@type parser
format json
key_name data
</filter>
<match system.event>
@type kafka_buffered
brokers kafka:9092
disable_retry_limit true
flush_interval 1s
compression_codec gzip
exclude_partition_key true
</match>```
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 1
- Comments: 23 (11 by maintainers)
Any update on this? We are having the same issue at our environments…
I’ve tested on my environment and reproduced with following script. This script can reproduce this issue sometimes.
I guess the cause of issue is race condition of
StatWatcher
. There are 3 events when run sequence at once (in a few milliseconds). https://github.com/fluent/fluentd/issues/1383#issuecomment-306456252IN_MOVE_SELF
)IN_CREATE
)IN_MODIFY
)These events invokes
StatWatcher#on_change
,TailWatcher#on_rotate
andTailWatcher#swap_state
.TailWatcher#swap_state
swapsMemoryPositionEntry
that have same information (different instance) when position file is not updated.