opentelemetry-collector: Hostmetrics Receiver throws 'error reading process name' error for process scraper
Describe the bug I’m using the OTEL contrib collector to collect host metrics from an Ubuntu machine. However, when I add the process scraper to the hostmetrics receiver config, it throws an error reading process name … permission denied error for seemingly every PID in my system.
I haven’t specifically reproduced this with the core collector but, since the issue is with the core hostmetrics receiver, I assume the bug is present in the core collector as well.
Steps to reproduce
- Assume root on the Ubuntu machine:
sudo -s
- Download
v0.24.0
of the contrib collector deb file (otel-contrib-collector_0.24.0_amd64.deb) - Install contrib collector:
dpkg --install otel-contrib-collector_0.24.0_amd64.deb
- Configure it to collect host metrics (specifically, process data) via the hostmetrics receiver and process scraper
What did you expect to see? No errors
What did you see instead? Every minute, an error message is generated complaining about error reading process name … permission denied for seemingly every PID on the machine:
Apr 23 15:34:37 ip-10-249-29-79 otelcontribcol[…]: 2021-04-23T15:34:37.264-0400 error scraperhelper/scrapercontroller.go:206 Error scraping metrics {“kind”: “receiver”, “name”: “hostmetrics”, “error”: “[error reading process name for pid 1: readlink /proc/1/exe: permission denied; error reading process name for pid 2: readlink /proc/2/exe: permission denied; error reading process name for pid 3: readlink /proc/3/exe: permission denied; …]”}
What version did you use?
v0.24.0
of the contrib collector (otel-contrib-collector_0.24.0_amd64.deb)
What config did you use?
receivers:
hostmetrics:
collection_interval: 1m
scrapers:
process:
...
Environment
OS: Ubuntu 18.04
Additional context N/A
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 7
- Comments: 19
expected/non-harmful error? The fact that it is generating ~7.5 MB of spam in both my daemon log and syslog per day sure seems harmful to me (not to mention my syslog is now unreadable). That’s 450 MB of spam per month per google cloud instance I am running. Not sure how this made it to GA as is and replaced the Legacy agent on google cloud 😦
I looked around and it looks like the errors I’m getting are caused by the receiver trying to read exe file names of kernel threads that don’t have file names. It was not a permission error. I don’t think these cases should be treated as errors.
Whether a certain process is a kernel thread can be checked using the 9th field of the
stat
file (if 0x00200000 is set) on Linux, however this check is not supported bygopsutil
. I think the options includegopsutil
to support this checkgopsutil
to ignore readlink errorsWhat do y’all think?
Note that in Google Cloud, this is an expected/non-harmful error (though I think it should still be fixed):
Hey @codeconsole, FWIW: I dropped the following configuration in rsyslog.d to make my syslog readable (and avoid storing these messages).
/etc/rsyslog.d/99-exclude-otel.conf
:systemctl restart rsyslog.d
This doesn’t address the larger issue here but it would make rsyslog.d more usable in the mean time…
how did you solve this error?
I think silently discarding all process scrape errors is the best option. There might be other cases (SELinux, AppArmor, containers, network filesystems, Windows kernel protections, etc.) that cause parts of a process’s status to be unreadable. While a user might conceivably want to debug that, it happens on ~all systems for one reason or other so it shouldn’t be part of anything that gets logged by default.