fluentd: in_tail pos file stops updating after log rotate

Sometimes (every few weeks) during a log rotation of the file it’s tailing fluentd will stop updating its .pos file. Even a day later when the log file rotates again it does not update the pos file. Even though the pos file is not being updated the file is being consumed and the output is being produced. The contents of the pos file are a few lines into the new log file and the inode number of the new log file.

The double detected rotation/following tail did not happen on previous days, however I have seen it on another server and it didn’t appear to cause a problem that time.

A restart of the service service td-agent restart fixes the problem, but it reads the log from the start, which is less than ideal.

Version: td-agent 2.3.3 x64 OS: Redhat 6.8

/var/log/td-agent/td-agent.log.1

2016-12-20 03:08:03 +0000 [info]: detected rotation of /logs/application.log; waiting 5 seconds
2016-12-20 03:08:03 +0000 [info]: following tail of /logs/application.log
2016-12-20 03:08:03 +0000 [info]: detected rotation of /logs/application.log; waiting 5 seconds
2016-12-20 03:08:03 +0000 [info]: following tail of /logs/application.log
2016-12-20 03:11:37 +0000 [info]: force flushing buffered events

/var/log/td-agent/td-agent.log.2.gz

2016-12-19 03:27:03 +0000 [info]: detected rotation of /logs/application.log; waiting 5 seconds
2016-12-19 03:27:03 +0000 [info]: following tail of /logs/application.log
2016-12-19 03:32:12 +0000 [info]: force flushing buffered events

/var/log/td-agent/application.log.pos: /logs/application.log 000000000000f16d 00000000000a419e

The pos file last updated on log rotation

ll /var/log/td-agent/application.log.pos
-rw-r--r--. 1 td-agent td-agent   80 Dec 20 03:08 application.log.pos
$ stat /logs/application.log 
  File: `/logs/application.log'
  Size: 2188501633	Blocks: 4274432    IO Block: 4096   regular file
Device: fd00h/64768d	Inode: 672158      Links: 1

strace of the fluentd process

[pid 15119] inotify_add_watch(14, "/logs/application.log", IN_MODIFY|IN_ATTRIB|IN_MOVED_FROM|IN_MOVED_TO|IN_CREATE|IN_DELETE|IN_DELETE_SELF|IN_MOVE_SELF|IN_DONT_FOLLOW|IN_MASK_ADD) = 102215122
[pid 15119] statfs("/logs/application.log", {f_type="EXT2_SUPER_MAGIC", f_bsize=4096, f_blocks=5845471, f_bfree=4475644, f_bavail=4177046, f_files=1493280, f_ffree=1288943, f_fsid={-615535995, 789078184}, f_namelen=255, f_frsize=4096}) = 0
[pid 15119] lstat("/logs/application.log", {st_mode=S_IFREG|0644, st_size=2100168079, ...}) = 0
[pid 15119] stat("/logs/application.log", {st_mode=S_IFREG|0644, st_size=2100168079, ...}) = 0
[pid 15119] read(25, "[20/Dec/2016:09:31:59.667731 +00"..., 2048) = 252
[pid 15119] read(25, "", 2048)          = 0
[pid 15119] lseek(25, 0, SEEK_CUR)      = 2100168079
[pid 15119] clock_gettime(CLOCK_MONOTONIC, {4149597, 999687012}) = 0
[pid 15119] epoll_wait(13, {{EPOLLIN, {u32=14, u64=4294967310}}}, 64, 123) = 1
[pid 15119] clock_gettime(CLOCK_MONOTONIC, {4149597, 999728739}) = 0
[pid 15119] read(14, "\321\255\27\6\0\200\0\0\0\0\0\0\0\0\0\0", 287) = 16
[pid 15119] clock_gettime(CLOCK_MONOTONIC, {4149597, 999767484}) = 0
[pid 15119] epoll_wait(13, {{EPOLLIN, {u32=14, u64=4294967310}}}, 64, 123) = 1
[pid 15119] clock_gettime(CLOCK_MONOTONIC, {4149598, 19991561}) = 0
[pid 15119] read(14, "\322\255\27\6\2\0\0\0\0\0\0\0\0\0\0\0", 287) = 16
[pid 15119] lstat("/logs/application.log", {st_mode=S_IFREG|0644, st_size=2100168990, ...}) = 0
[pid 15119] inotify_rm_watch(14, 102215122) = 0

Config:

<source>
  @type tail
  format none
  path /logs/application.log
  pos_file /var/log/td-agent/application.log.pos
  tag system.event
</source>

<filter system.event>
  @type parser
  format /\[SYSEVT\] (?<data>.+)/
  key_name message
  suppress_parse_error_log 
</filter>

<filter system.event>
  @type parser
  format json
  key_name data
</filter>



<match system.event>
  @type kafka_buffered
  brokers kafka:9092
  disable_retry_limit true
  flush_interval 1s
  compression_codec gzip
  exclude_partition_key true
</match>```

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 1
  • Comments: 23 (11 by maintainers)

Commits related to this issue

Most upvoted comments

Any update on this? We are having the same issue at our environments…

I’ve tested on my environment and reproduced with following script. This script can reproduce this issue sometimes.

#!/bin/bash

while true; do
    rm -f /tmp/application.log.1
    mv /tmp/application.log /tmp/application.log.1
    touch /tmp/application.log
    t=$(date +%s.%N)
    echo "some stuff ${t}" >> /tmp/application.log
    inode1=$(cat /tmp/application.log.pos | cut -f 3)
    inode2=$(stat /tmp/application.log | ruby -e 'puts ARGF.read[/Inode:\s*(\d+)/, 1].to_i.to_s(16)')

    echo $inode1 $inode2 $t
    if [ $inode1 != "0000000000${inode2}" ]; then
        exit
    fi
    #sleep 0.5
done

I guess the cause of issue is race condition of StatWatcher. There are 3 events when run sequence at once (in a few milliseconds). https://github.com/fluent/fluentd/issues/1383#issuecomment-306456252

  1. Move log file (IN_MOVE_SELF)
  2. Create log file (IN_CREATE)
  3. Modify log file (IN_MODIFY)

These events invokes StatWatcher#on_change, TailWatcher#on_rotate and TailWatcher#swap_state.

TailWatcher#swap_state swaps MemoryPositionEntry that have same information (different instance) when position file is not updated.