fluentd: Fluentd in_tail "unreadable" file causes "following tail of " to stop and no logs pushed

Describe the bug

After a warning of an “unreadable” (likely due to rotation), no more logs were pushed (in_tail + pos_file). Reloading config or restarting fluentd sorts the issue. All other existing files being tracked continued to work as expected.

To Reproduce

Not able to reproduce at will.

Expected behavior

Logs to be pushed as usual after file rotation as fluentd recovers from the temporary “unreadable” file.

Your Environment

- Fluentd version: 1.14.4
- TD Agent version: 4.2.0
- Operating system: Ubuntu 20.04.3 LTS
- Kernel version: 5.11.0-1022-aws

Your Configuration

<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/es-containers.log.pos
  tag kubernetes.*
  <parse>
    @type regexp
    expression /^(?<time>[^ ]+) (?<stream>[^ ]+) (?<logtag>[^ ]+) (?<log>.+)$/
    time_key time
    time_format %Y-%m-%dT%H:%M:%S.%N%z
  </parse>
  read_from_head true
</source>

Your Error Log

Relevant log entries with some context. When "detected rotation of ..." isn't followed by a "following tail of ..." then log file contents aren't being processed/pushed:

2022-02-01 01:26:33 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 01:26:33 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log
2022-02-01 01:32:53 +0000 [warn]: #0 /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log unreadable. It is excluded and would be examined next time.
2022-02-01 01:32:54 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 01:32:54 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log
2022-02-01 01:38:04 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 01:44:44 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 01:53:15 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
[...]
---- after issuing a config reload (kill -SIGUSR2 <pid>) it starts to work fine again, i.e. "following tail of ..." 
2022-02-01 11:36:19 +0000 [info]: Reloading new config
2022-02-01 11:36:19 +0000 [info]: using configuration file: <ROOT>
[...]
2022-02-01 11:36:19 +0000 [info]: shutting down input plugin type=:tail plugin_id="object:c6c0"
[...]
2022-02-01 11:36:19 +0000 [info]: adding source type="tail"
2022-02-01 11:36:19 +0000 [info]: #0 shutting down fluentd worker worker=0
2022-02-01 11:36:19 +0000 [info]: #0 shutting down input plugin type=:tail plugin_id="object:c6c0"
[...]
2022-02-01 11:36:20 +0000 [info]: #0 restart fluentd worker worker=0
---- the entry below seems to be related with the actual underlying issue... the ruby object which stopped pushing logs has now been terminated as a new one was created
2022-02-01 11:36:20 +0000 [warn]: #0 /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log already exists. use latest one: deleted #<struct Fluent::Plugin::TailInput::Entry path="/var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log", pos=10740032, ino=1797715, seek=1530>
2022-02-01 11:36:20 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log
2022-02-01 11:37:30 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 11:37:30 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log
2022-02-01 11:43:20 +0000 [info]: #0 detected rotation of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log; waiting 5 seconds
2022-02-01 11:43:20 +0000 [info]: #0 following tail of /var/log/containers/gateway-mtcw6_atom_pcms-gateway-e54ea9ee115f1127682b0d92e37404e4b1693c14edeea93e485f70b3110eecfc.log

Additional context

This issue seems be related with #3586 but unfortunately I didn’t check the pos file while the issue was happening so can’t tell if it presented unexpected values for the failing file.

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 4
Comments: 73 (40 by maintainers)

Commits related to this issue

in_tail: Show more information on skipping update_watcher In #3614 some users reported that in_tail rarely stops tailing. It seems it's caused when skipping update_watcher due to unexpected duplicate... — committed to fluent/fluentd by ashie 2 years ago
in_tail: Fix for unexpected file close after logs rotate in update_watcher and stop_watcher The filesystem reuse of inodes and fluent using inode as the "key" for maintaining watches. However some 'w... — committed to kattz-kawa/fluentd by kattz-kawa a year ago
in_tail: Fix for unexpected file close after logs rotate in update_watcher and stop_watcher The filesystem reuse of inodes and fluent using inode as the "key" for maintaining watches. However some 'w... — committed to kattz-kawa/fluentd by kattz-kawa a year ago
in_tail: Fix for unexpected file close after logs rotate in update_watcher and stop_watcher Some 'watcher' method is using 'path' as the key in fluentd v1.16.1. It is strange to manage with path in f... — committed to kattz-kawa/fluentd by kattz-kawa a year ago
test: in_tail: Add tests for rotation with follow_inodes Many problems related to rotation have been reported. These tests are to reproduce the problems with follow_inodes, especially talked in #3614... — committed to daipom/fluentd by daipom a year ago
in_tail: Ensure to detach correct watcher on rotation with follow_inodes If `refresh_watchers` run before `update_watcher`, the old implementation of `update_watcher` detach wrongly the new TailWatch... — committed to daipom/fluentd by daipom a year ago
in_tail: Ensure to detach correct watcher on rotation with follow_inodes If `refresh_watchers` run before `update_watcher`, the old implementation of `update_watcher` detach wrongly the new TailWatch... — committed to daipom/fluentd by daipom a year ago
in_tail: Ensure to detach correct watcher on rotation with follow_inodes If `refresh_watchers` run before `update_watcher`, the old implementation of `update_watcher` detach wrongly the new TailWatch... — committed to daipom/fluentd by daipom a year ago
in_tail: Fix a stall bug on !follow_inode case Fix #3614 Although known stall issues of in_tail on `follow_inode` case are fixed in v1.16.2, it has still a similar problem on `!follow_inode` case. ... — committed to fluent/fluentd by ashie 8 months ago
in_tail: Fix a stall bug on !follow_inode case Fix #3614 Although known stall issues of in_tail on `follow_inode` case are fixed in v1.16.2, it has still a similar problem on `!follow_inode` case. ... — committed to fluent/fluentd by ashie 8 months ago
in_tail: Fix a stall bug on !follow_inode case Fix #3614 Although known stall issues of in_tail on `follow_inode` case are fixed in v1.16.2, it has still a similar problem on `!follow_inode` case. ... — committed to fluent/fluentd by ashie 8 months ago
in_tail: Fix a stall bug on !follow_inode case (#4327) Fix #3614 Although known stall issues of in_tail on `follow_inode` case are fixed in v1.16.2, it has still a similar problem on `!follow_ino... — committed to fluent/fluentd by ashie 8 months ago
in_tail: Fix a stall bug on !follow_inode case (#4327) Fix #3614 Although known stall issues of in_tail on `follow_inode` case are fixed in v1.16.2, it has still a similar problem on `!follow_ino... — committed to daipom/fluentd by ashie 8 months ago

Most upvoted comments

Thanks for your report!

And, fluentd reported Skip update_watcher because watcher has been already updated by other inotify event following detected rotation of /var/log/server.log; waiting 5 seconds. The message seems to be reported by https://github.com/fluent/fluentd/blob/v1.13.3/lib/fluent/plugin/in_tail.rb#L482. It seems to be strongly related position file having two entries.

To tell the truth, I was suspecting it but I couldn’t confirm it because I can’t yet reproduce it. Your report it very helpful for me.

Although I’m not yet sure the mechanism of this issue, you might be able to avoid this issue by disabling stat watcher (enable_stat_watcher false).

ashie on Mar 17, 2022

We will check https://github.com/fluent/fluentd/pull/4185. I would appreciate it if you could check https://github.com/fluent/fluentd/pull/4208 too.

Sorry, “We will check https://github.com/fluent/fluentd/pull/4185” was a typo ^^; I will check #4208!

masaki-hatada on Jun 21, 2023

I created a test case in test/plugin/test_in_tail.rb for just playing around with this exact event:

sub_test_case "multiple log rotations" do
    data(
      "128" => ["128-0.log", 128, "msg"],
      "256" => ["256-0.log", 256, "msg"],
      "512" => ["512-0.log", 512, "msg"],
      "1024" => ["1024-0.log", 1024, "msg"],
      "2048" => ["2048-0.log", 2048, "msg"],
      "4096" => ["4096-0.log", 4096, "msg"],
      "8192" => ["8192-0.log", 8192, "msg"],
    )

    def test_reproduce_err_after_rotations(data)
      file, num_lines, msg = data

      File.open("#{@tmp_dir}/#{file}", 'wb') do |f|
        num_lines.times do
          f.puts "#{msg}\n"
        end
      end

      conf = config_element("", "", {
        "path" => "#{@tmp_dir}/*.log.link",
        "pos_file" => "#{@tmp_dir}/tail.pos",
        "refresh_interval" => "1s",
        "read_from_head" => "true",
        "format" => "none",
        "rotate_wait" => "1s",
        "follow_inodes" => "true",
        "tag" => "t1",
      })

      link_name="#{@tmp_dir}/#{num_lines}.log.link"

      File.symlink(file, link_name)

      dl_opts = {log_level: Fluent::Log::LEVEL_DEBUG}
      logdev = $stdout
      logger = ServerEngine::DaemonLogger.new(logdev, dl_opts)
      log_instance = Fluent::Log.new(logger)

      rotations = 5
      rot_now = 1

      d = create_driver(conf, false)
      d.instance.log = log_instance
      d.run(timeout: 30) do
        sleep 1

        assert_equal(num_lines, d.record_count)

        # rotate logs
        while rot_now <= rotations do
          sleep 2

          puts "unlink #{link_name}"
          File.unlink(link_name)
          puts "symlink #{num_lines}-#{rot_now}.log #{link_name}"
          File.symlink("#{num_lines}-#{rot_now}.log", link_name)

          sleep 1

          File.open("#{@tmp_dir}/#{num_lines}-#{rot_now}.log", 'wb') do |f|
            num_lines.times do
              f.puts "#{msg}\n"
            end
          end

          assert_equal(num_lines*rot_now, d.record_count)

          rot_now = rot_now + 1
        end
      end
    end
  end

In this case, it seems to be working properly. But maybe we can help each-other in reproducing the error?

UPDATE: Changed to multiple files of different sizes, and changed log rotation to how kubelet actually does it. UPDATE 2: Code updated! Now it seems I’ve got something that somehow reproduces the error, in this very state the tests works fine. But if you comment out the "follow_inodes" => "true", line, the error comes up and the log line Skip update_watcher because watcher has been.... mentioned by @ashie above, is logged all the time. So I think that the follow_inodes option is not only important for preventing duplicate log messages, but also for tail on wildcards, and symlinks to work properly!

artheus on Aug 23, 2022

@vparfonov Could you provide your in_tail config? Do you use follow_inode true?

ashie on Aug 1, 2022

We downgraded td-agent 4.3.0 (fluentd v1.14.3) to td-agent 4.2.0 (fluentd v1.13.3), and still have problems. We broadened monitoring targets to 500 servers (approx 1500 files) from 200 servers (approx 400 files) as a result of this issue. Then, we found the 2 servers (2 files) having this issue after a week from td-agent downgraded. Upgrading fluentd 1.13.3 -> 1.14.3 does not seem to be the cause, sorry.

It seemed that some of our log files were affected and some were not. The differences are as follows.

rotation policy; the unaffected files were using a time based rotation policy (e.g. *.log-%Y%m%d-%s), while the affected files were using a fixed window policy (e.g. *.%i.log). The fixed window policy takes longer to rotate than the time based rotation policy (because it requires renaming all the previous files).
rotation period; unaffected files are rotated in 30 minutes, while affected files are rotated in about 7 minutes. However, it may be just a matter of probability since the more rotations we do, the higher the probability that we will get this problem.

We are planning to change rotation policy and retry upgrading td-agent. And, we will comment some updates if we get.

mizunashi-mana on Feb 22, 2022

We have encountered the same problem twice. Last month we upgraded fluentd from 1.13.3 (td-agent 4.2.0) to 1.14.3 (td-agent 4.3.0), and then we have got the problem.

In our case, we are tracking more than 3 log files with the tail plugin, and only the file with the highest log flow has this problem. From the metrics produced by prometheus plugin, the file position always fluctuates by at least 400000/sec and the rotation happens at least once every 10 minutes.

The input source is set as follows:

<source>
  @type tail
  format ...
  path /var/log/server.log
  pos_file /var/log/td-agent/server.log.pos
  tag input.server.log
</source>

I will also share the record of the investigation when the problem occurs. First, fluentd suddenly, without warning, stops tracking new rotated files and only logs the following in the stdout log.

detected rotation of /var/log/server.log; waiting 5 seconds

Normally, it would have been followed by an announcement that a new file was being tracked, as follows

The following tail of /var/log/server.log

However, we have not seen this announcement at all after the problem occurs. Also, when I view the file descriptors that fluentd is tracking, I see the following:

$ sudo ls "/proc/$(pgrep -f 'td-agent.*--under-supervisor')/fd" -l | grep server
lr-x------ 1 td-agent td-agent 64 Feb 5 01:41 49 -> /var/log/server.3.log

This is the rotated file, and we see that fluentd is still tracking the old file, not the new one. Also, looking at the position file, we see the following:

$ sudo cat /var/log/td-agent/server.log.pos 
/var/log/server.log 000000002001f2d0 000000003a000042
/var/log/server.log 0000000000000000 0000000000000000

As far as the situation is concerned, it seems that something triggers fluentd to stop tracking new files altogether during rotate. We have about 300 servers running 24/7 and have only had two problems in the past month, so it seems to be pretty rare for problems to occur. However, since we haven’t had any problems with fluentd 1.13.3 for 6 months, it seems likely that there was some kind of regression in fluentd 1.13.3 -> 1.14.3.

We are planning to downgrade fluentd.

mizunashi-mana on Feb 14, 2022

Not yet confirmed for follow_inode false case. Reopen.

The mechanism of #4190 doesn’t depend on follow_inode, so it definitely affects to follow_inode false case too and #4208 should fixes it. I believe #4190 is the root cause of this issue.

I’ll close this issue after we check it.

ashie on Jul 10, 2023

@masaki-hatada

https://github.com/fluent/fluentd/pull/4208 is the replacement of https://github.com/fluent/fluentd/pull/4185, isn’t it? We will check https://github.com/fluent/fluentd/pull/4185.

I would appreciate it if you could check #4208 too. I want to merge one of #4208 and #4185. (I will review #4191 later) Both would improve this problem. I think #4208 would be a more direct solution, but I want to hear opinions.

(Thank you for adding me and @kattz-kawa as Co-author ^_^)

I can create #4208 thanks to #4185 and #4191! Until these PRs were created, I had no idea what was causing this problem. Thank you all for your contributions!

daipom on Jun 21, 2023

Reply to @kattz-kawa and @k-keiichi-rh

Ideally, when follow_inode is on, all things should be tracked by inode, instead of partial by inode, partial by path. So, I vote for your PR (I actually want to do a similar PR 😃 )

In recent days, I’m heavily testing your PR. It runs good with a high log throughput and high log file rotation rate (running in k8s , log rotation pattern : old log file will be renamed and new log file will be created with same name) My testing env is 20K rows / sec, 2KB / row, ~10 sec log rotation interval. No log dup, no log missing.

garyzjq on Jun 4, 2023

* Using inode as keys might have side effect, we might experience such case in the past.
  * I don't yet remember the detail. Probably one of side effect is log duplication.

On the other hand, certainly I also think that it is strange to manage with path in follow_inode true case. In addition, probably the case we experienced before is mixed other conditions. I’m digging out it.

ashie on May 17, 2023

we have a customer who claims they applied the following patch and the issue goes away. I do not have a reliable way to reproduce the problem. We are using v1.14.6

This is my proposed patch for the customer.

@jcantrill Thank you for validating this patch.

I’m going to walk back my statement to some after applying the proposed patch and doing similar rudimentary testing. The patch demonstrates an improvement with “missed” file rotations. I can not speak, however, to whether:

@ashie

This issue is observed by Jeff’s primitive test code. Additionally this issue also is observed on the popular enviroment like as using ‘logrotate’ program. (such as k8s)

I attach the simple reproducer program using ‘logrotate’.

the reproducible way is below:

Test environment:
  RHEL8.7 on KVM(2 vCPUs, 4GB memory)

Kernel version:
  Linux 4.18.0-425.3.1.el8.x86_64

I tested it on RHEL8.7(kernel-4.18.0-425.3.1.el8.x86_64) KVM, 
However this procedure could run even on another Linux since it runs as a container.
Please replace all "podman" string to "docker" in test_run.sh , this include in reproducer.tar.gz, 
if your linux has no podman command.

Steps to reproduce:

1. Untar 'reproducer.tar.gz' on your machine

# tar zxf reproducer.tar.gz

2. Run 'test_run.sh' script

# nohup bash test_run.sh &

3. Wait until the "The issue has happend." message is printed in the 'lrt.log'
　 'lrt.log' will be generated on the current directory.
　 Run 'tail -f' command to monitor that "The issue has happend" message is printed on 'lrt.log'.

    Ex) The issue has happened. Mon Mar 13 07:23:16 JST 2023

4. Confirm that fluentd has already closed '/var/log/pods/test/0.log' although the file still exists.

     1) Check INODE of '/var/log/pods/test/0.log'

        # podman exec -it reproducer /bin/bash
        # ls -li /var/log/pods/test/0.log

     2) Check Fluentd never track the '/var/log/pods/test/0.log'

        # lsof -p <fluentd pid> | grep <INODE of 0.log file>
          => It's closed although Fluentd has to collect the '/var/log/pods/test/0.log'

     3) Check the position file if the '/var/log/pods/test/0.log' should be tracked

        # awk "{printf(\"%s\t%s\t%s\n\", \$1, strtonum(\"0x\"\$2), strtonum(\"0x\"\$3))}"  /var/lib/fluentd/pos/es-containers.log.pos | grep <INODE of 0.log file>
          => If the inode exists, it means the log file should be collected by Fluentd

        ex)
           /var/log/pods/test/0.log        20165   486593998

     4) Check if the '/var/log/pods/test/0.log' is not tracked any more
  
        # kill -s SIGCONT {logger.sh pid}
        => Resume to send test logs to the '/var/log/pods/test/0.log'

        However Fluent won't reopen/reread the '/var/log/pods/test/0.log'
        As far as I can see this issue happen if 'follow_inodes' set 'true'.

I will try to create a pull request on this issue later. Could you review it? I hope this attached program is helpful to you to validate my patch. reproducer.tar.gz

kattz-kawa on May 16, 2023

@jcantrill Thanks for your information! We’ll check it.

ashie on Feb 24, 2023

@ashie I believe the issue is because the filesystem reuse of inodes and fluent using inode as the “key” for maintaining watches. Our customer provided the following where then position entry is converted to decimal:

-- Pos file
/var/log/pods/openshift-image-registry_node-ca-k6thz_78bc195f-591c-4af5-afb1-a31c601850e9/node-ca/0.log 18446744073709551616    195064732
/var/log/pods/openshift-image-registry_node-ca-k6thz_78bc195f-591c-4af5-afb1-a31c601850e9/node-ca/0.log 18446744073709551616    195064733
/var/log/pods/openshift-image-registry_node-ca-k6thz_78bc195f-591c-4af5-afb1-a31c601850e9/node-ca/0.log 58444023        195064732
/var/log/pods/openshift-image-registry_node-ca-k6thz_78bc195f-591c-4af5-afb1-a31c601850e9/node-ca/0.log 35228450        195064763

Note there are multiple entries with the same inode where one is identified as “unwatched” and the other is still open. The logic then closes the watch on the inode even though it is still in use

jcantrill on Feb 22, 2023

@ashie we have a customer who claims they applied the following patch and the issue goes away. I do not have a reliable way to reproduce the problem. We are using v1.14.6

--- in_tail.rb.org      2023-02-13 17:04:42.353997075 +0900
+++ in_tail.rb  2023-02-13 17:04:39.030016491 +0900
@@ -354,13 +354,15 @@

     def existence_path
       hash = {}
-      @tails.each_key {|target_info|
-        if @follow_inodes
-          hash[target_info.ino] = target_info
-        else
-          hash[target_info.path] = target_info
-        end
-      }
+      if @follow_inodes
+        @tails.each {|ino, tw|
+          hash[tw.ino] = TargetInfo.new(tw.path, tw.ino)
+        }
+      else
+        @tails.each {|path, tw|
+          hash[tw.path] = TargetInfo.new(tw.path, tw.ino)
+        }
+      end
       hash
     end

@@ -441,8 +443,11 @@

       begin
         target_info = TargetInfo.new(target_info.path, Fluent::FileWrapper.stat(target_info.path).ino)
-        @tails.delete(target_info)
-        @tails[target_info] = tw
+        if @follow_inodes
+          @tails[target_info.ino] = tw
+        else
+          @tails[target_info.path] = tw
+        end
         tw.on_notify
       rescue Errno::ENOENT, Errno::EACCES => e
         $log.warn "stat() for #{target_info.path} failed with #{e.class.name}. Drop tail watcher for now."
@@ -462,9 +467,17 @@
     def stop_watchers(targets_info, immediate: false, unwatched: false, remove_watcher: true)
       targets_info.each_value { |target_info|
         if remove_watcher
-          tw = @tails.delete(target_info)
+          if @follow_inodes
+            tw = @tails.delete(target_info.ino)
+          else
+            tw = @tails.delete(target_info.path)
+          end
         else
-          tw = @tails[target_info]
+          if @follow_inodes
+            tw = @tails[target_info.ino]
+          else
+            tw = @tails[target_info.path]
+          end
         end
         if tw
           tw.unwatched = unwatched
@@ -478,10 +491,19 @@
     end

     def close_watcher_handles
-      @tails.keys.each do |target_info|
-        tw = @tails.delete(target_info)
-        if tw
-          tw.close
+      if @follow_inodes
+        @tails.keys.each do |ino|
+          tw = @tails.delete(ino)
+          if tw
+            tw.close
+          end
+        end
+      else
+        @tails.keys.each do |path|
+          tw = @tails.delete(path)
+          if tw
+            tw.close
+          end
         end
       end
     end
@@ -500,26 +522,25 @@
       end

       rotated_target_info = TargetInfo.new(target_info.path, pe.read_inode)
-      rotated_tw = @tails[rotated_target_info]
-      new_target_info = target_info.dup

       if @follow_inodes
+        rotated_tw = @tails[rotated_target_info.ino]
+        new_target_info = target_info.dup
         new_position_entry = @pf[target_info]

         if new_position_entry.read_inode == 0
           # When follow_inodes is true, it's not cleaned up by refresh_watcher.
           # So it should be unwatched here explicitly.
           rotated_tw.unwatched = true
-          # Make sure to delete old key, it has a different ino while the hash key is same.
-          @tails.delete(rotated_target_info)
-          @tails[new_target_info] = setup_watcher(new_target_info, new_position_entry)
-          @tails[new_target_info].on_notify
+          @tails[new_target_info.ino] = setup_watcher(new_target_info, new_position_entry)
+          @tails[new_target_info.ino].on_notify
         end
       else
-        # Make sure to delete old key, it has a different ino while the hash key is same.
-        @tails.delete(rotated_target_info)
-        @tails[new_target_info] = setup_watcher(new_target_info, pe)
-        @tails[new_target_info].on_notify
+        rotated_tw = @tails[rotated_target_info.path]
+        new_target_info = target_info.dup
+
+        @tails[new_target_info.path] = setup_watcher(new_target_info, pe)
+        @tails[new_target_info.path].on_notify
       end
       detach_watcher_after_rotate_wait(rotated_tw, pe.read_inode) if rotated_tw
     end

--- position_file.rb.org        2023-02-13 15:02:14.671608849 +0900
+++ position_file.rb    2023-02-13 15:06:09.098223852 +0900
@@ -250,20 +250,6 @@
       end
     end

-    TargetInfo = Struct.new(:path, :ino) do
-      def ==(other)
-        return false unless other.is_a?(TargetInfo)
-        self.path == other.path
-      end
-
-      def hash
-        self.path.hash
-      end
-
-      def eql?(other)
-        return false unless other.is_a?(TargetInfo)
-        self.path == other.path
-      end
-    end
+    TargetInfo = Struct.new(:path, :ino)
   end
 end

our typical config is:

<source>
  @type tail
  @id container-input
  path "/var/log/pods/*/*/*.log"
  exclude_path ["/var/log/pods/openshift-logging_collector-*/*/*.log", "/var/log/pods/openshift-logging_elasticsearch-*/*/*.log", "/var/log/pods/openshift-logging_kibana-*/*/*.log", "/var/log/pods/openshift-logging_*/loki*/*.log", "/var/log/pods/*/*/*.gz", "/var/log/pods/*/*/*.tmp"]
  pos_file "/var/lib/fluentd/pos/es-containers.log.pos"
  follow_inodes true
  refresh_interval 5
  rotate_wait 5
  tag kubernetes.*
  read_from_head "true"
  skip_refresh_on_startup true
  @label @CONCAT
  <parse>
    @type regexp
    expression /^(?<@timestamp>[^\s]+) (?<stream>stdout|stderr) (?<logtag>[F|P]) (?<message>.*)$/
    time_key '@timestamp'
    keep_time_key true
  </parse>
</source>

One significant difference I do see with the original report of the issue and how we collect logs is the symlink. Collecting from /var/log/pods which is the preferred kubernetes location AFAIK does not utilize symlinks as was done with /var/log/containers.

jcantrill on May 16, 2023

For what’s worth: after adding enable_stat_watcher false back in March (as mentioned by @ashie here) none of my 120+ servers experienced this issue again.

salavessa on Jul 8, 2022