telegraf: Permission denied in a middle of logrotate process

Relevant telegraf.conf

[[inputs.tail]]
  files = ["/var/log/nginx/access.log"]
  data_format = "grok"
  grok_patterns = ["%{COMBINED_LOG_FORMAT}"]

Logs from Telegraf

2023-04-27T00:00:02+03:00 E! [inputs.tail] Tailing "/var/log/nginx/access.log": Unable to open file /var/log/nginx/access.log: open /var/log/nginx/access.log: permission denied

System info

Telegraf 1.26.1 (git: HEAD@8f421f25), Docker 20.10.21, Ubuntu 22.04.2 LTS

Docker

Releant part of docker-compose.yml config. Group 4 is adm with read access to the log files.

services:
    telegraf:
        image: telegraf
        volumes:
            - /var/log/nginx:/var/log/nginx:ro
        group_add:
            - "4"

Steps to reproduce

  1. Start Telegraf with tail on log file;
  2. Daily rotate log file using logrotate utility;
  3. Enentually tail will stop with permission denied error. In my case it stopped after 9 days with daily log rotation.

Expected behavior

Retry tailing the file after failure.

Actual behavior

Tail stops after the error.

Additional info

The bug is not related to the file permissions, I’m able to start the Telegraf and obtain the logs as expected.

  File: /var/log/nginx/access.log
  Size: 285582    	Blocks: 568        IO Block: 4096   regular file
Device: fd00h/64768d	Inode: 524853      Links: 1
Access: (0640/-rw-r-----)  Uid: (   33/www-data)   Gid: (    4/     adm)
Access: 2023-04-27 09:07:29.001913262 +0300
Modify: 2023-04-27 12:32:01.672785512 +0300
Change: 2023-04-27 12:32:01.672785512 +0300
 Birth: 2023-04-27 00:00:02.631913631 +0300

But due to (un-)lucky circumstances tail plugin was trying to read the file exactly at the same time logrotate did its job.

Apr 27 00:00:01 docker-host systemd[1]: Starting Rotate log files...
Apr 27 00:00:03 docker-host systemd[1]: logrotate.service: Deactivated successfully.
Apr 27 00:00:03 docker-host systemd[1]: Finished Rotate log files.
Apr 27 00:00:03 docker-host systemd[1]: logrotate.service: Consumed 1.022s CPU time.

As far as I understand internal workings of the logrotate, it first creates the new log file, and then applies chmod and chown. And seems tail plugin failed in a middle of this process and stopped sending the logs. Restarting the Telegraf at the morning did resolved the problem.

But it would be nice to have built-in retry mechanism for such cases. Actually tail does retry, but only when file does not exist, rather then when file exists but (temporarly) is not accessible. https://github.com/influxdata/tail/blob/master/tail.go#L194

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (10 by maintainers)

Most upvoted comments

One possible option we talked about today: the tail plugin is a service plugin. Meaning it gathers data as it needs to, rather than using the gather cycle. We could shift to using gather cycle to retry files that may have failed. We would need to keep a list of failed files, re-try during gather, and remove them if they start working.

Thanks for the PR.

Deployed it to my host.

2023-11-28T11:02:55+02:00 I! Starting Telegraf 1.29.0-4b867c55 brought to you by InfluxData the makers of InfluxDB

So far no problems. Also tried to manually run logrotate, but I’m not sure if it actually did anything. Let’s see how it behaves during true nightly logrotate.

I will let you know if I note anything strange.

@dayvar14,

Happy to see a PR! We don’t have something precisely like this already. Essentially I think what we wanted to do was:

  1. Catch this specific error and add the file to a slice of failed files
  2. In the Gather function start to check if that slice is not empty and if so go through those files and re-try them
  3. If the read starts to succeed we can remove it from the slice and continue to read, essentially re-init on that one file
  4. If it fails, we can try again during the next call to gather

This would help with short-term issues like these rotation issues, so we do not fail, but at least do a retry.