telegraf: Permission denied in a middle of logrotate process
Relevant telegraf.conf
[[inputs.tail]]
files = ["/var/log/nginx/access.log"]
data_format = "grok"
grok_patterns = ["%{COMBINED_LOG_FORMAT}"]
Logs from Telegraf
2023-04-27T00:00:02+03:00 E! [inputs.tail] Tailing "/var/log/nginx/access.log": Unable to open file /var/log/nginx/access.log: open /var/log/nginx/access.log: permission denied
System info
Telegraf 1.26.1 (git: HEAD@8f421f25), Docker 20.10.21, Ubuntu 22.04.2 LTS
Docker
Releant part of docker-compose.yml
config. Group 4 is adm
with read access to the log files.
services:
telegraf:
image: telegraf
volumes:
- /var/log/nginx:/var/log/nginx:ro
group_add:
- "4"
Steps to reproduce
- Start Telegraf with
tail
on log file; - Daily rotate log file using
logrotate
utility; - Enentually tail will stop with permission denied error. In my case it stopped after 9 days with daily log rotation.
Expected behavior
Retry tailing the file after failure.
Actual behavior
Tail stops after the error.
Additional info
The bug is not related to the file permissions, I’m able to start the Telegraf and obtain the logs as expected.
File: /var/log/nginx/access.log
Size: 285582 Blocks: 568 IO Block: 4096 regular file
Device: fd00h/64768d Inode: 524853 Links: 1
Access: (0640/-rw-r-----) Uid: ( 33/www-data) Gid: ( 4/ adm)
Access: 2023-04-27 09:07:29.001913262 +0300
Modify: 2023-04-27 12:32:01.672785512 +0300
Change: 2023-04-27 12:32:01.672785512 +0300
Birth: 2023-04-27 00:00:02.631913631 +0300
But due to (un-)lucky circumstances tail
plugin was trying to read the file exactly at the same time logrotate
did its job.
Apr 27 00:00:01 docker-host systemd[1]: Starting Rotate log files...
Apr 27 00:00:03 docker-host systemd[1]: logrotate.service: Deactivated successfully.
Apr 27 00:00:03 docker-host systemd[1]: Finished Rotate log files.
Apr 27 00:00:03 docker-host systemd[1]: logrotate.service: Consumed 1.022s CPU time.
As far as I understand internal workings of the logrotate
, it first creates the new log file, and then applies chmod
and chown
. And seems tail
plugin failed in a middle of this process and stopped sending the logs. Restarting the Telegraf at the morning did resolved the problem.
But it would be nice to have built-in retry mechanism for such cases.
Actually tail
does retry, but only when file does not exist, rather then when file exists but (temporarly) is not accessible.
https://github.com/influxdata/tail/blob/master/tail.go#L194
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (10 by maintainers)
One possible option we talked about today: the tail plugin is a service plugin. Meaning it gathers data as it needs to, rather than using the gather cycle. We could shift to using gather cycle to retry files that may have failed. We would need to keep a list of failed files, re-try during gather, and remove them if they start working.
Thanks for the PR.
Deployed it to my host.
So far no problems. Also tried to manually run
logrotate
, but I’m not sure if it actually did anything. Let’s see how it behaves during true nightly logrotate.I will let you know if I note anything strange.
@dayvar14,
Happy to see a PR! We don’t have something precisely like this already. Essentially I think what we wanted to do was:
Gather
function start to check if that slice is not empty and if so go through those files and re-try themThis would help with short-term issues like these rotation issues, so we do not fail, but at least do a retry.