kubernetes: Kubelet spamming 'Unable to fetch pod log stats' log messages after running cronjobs
What happened?
I have a cronjob that runs a Python script every 5 minutes. After the cronjob finishes running, I start seeing a bunch of log messages in my systemd journal like the following:
Dec 10 21:13:26 gb-7983 kubelet[1386178]: E1210 21:13:26.048459 1386178 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/k8s-homelab_transmission-manager-27319165--1-n2xq8_0a0270b6-d650-453d-96db-2fbb7c43ba42: no such file or directory" pod="k8s-homelab/transmission-manager-27319165--1-n2xq8"
Dec 10 21:13:26 gb-7983 kubelet[1386178]: E1210 21:13:26.049381 1386178 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/k8s-homelab_transmission-manager-27319175--1-frnvv_4c1ba4d5-1098-4644-ad4e-15eaa6e76767: no such file or directory" pod="k8s-homelab/transmission-manager-27319175--1-frnvv"
Dec 10 21:13:26 gb-7983 kubelet[1386178]: E1210 21:13:26.050750 1386178 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/k8s-homelab_transmission-manager-27319170--1-k7dwv_7a94a981-25e0-4079-a60f-9ed00bb859bb: no such file or directory" pod="k8s-homelab/transmission-manager-27319170--1-k7dwv"
Those directories in /var/log don’t exist as they belong to old pods that have finished running.
These messages keep increasing in volume as time passes (almost received a 1000 of those messages at one time, after which I had to reboot the node to fix the issue). It almost looks like the kubelet is not cleaning up the old pods for some reason. I can’t see these old pods when I run kubectl get pods -A
, so I’m assuming their is some issue with the kubelet’s cleanup process.
What did you expect to happen?
kubelet (or cadvisor) shouldn’t try to read logs for pods that don’t exist.
How can we reproduce it (as minimally and precisely as possible)?
Setup a short lived cronjob running every 5 minutes and observe the logs (Kubernetes v1.22.4, CRI-O v1.22.1, Ubuntu 20.04)
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider
Baremetal cluster
OS version
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
$ uname -a
Linux gb-7983 5.11.0-41-generic #45~20.04.1-Ubuntu SMP Wed Nov 10 10:20:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Install tools
Container runtime (CRI) and and version (if applicable)
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 16
- Comments: 42 (12 by maintainers)
Commits related to this issue
- use /run/ symlink to work around string comparison As explained in https://github.com/kubernetes/kubernetes/issues/106957#issuecomment-1167388634 -- the CRI-O team found performance issues in the CRI... — committed to ocf/puppet by nikhiljha 2 years ago
- use /run/ symlink to work around string comparison (#1301) As explained in https://github.com/kubernetes/kubernetes/issues/106957#issuecomment-1167388634 -- the CRI-O team found performance issues in... — committed to ocf/puppet by nikhiljha 2 years ago
- remove hardcoded UsingLegacyCadvisorStats workaround for CRI-O This workaround was implemented due to performance issues, but has since caused a new issue https://github.com/kubernetes/kubernetes/iss... — committed to nikhiljha/kubernetes by nikhiljha 2 years ago
This would work for OKD, but wouldn’t be supported for OCP OOTB. https://github.com/kubernetes/kubernetes/pull/108855 should fix, but it needs a reviewer/approver…
Fun kubernetes fact! The CRI-O team found performance issues with the CRI stats provider, so there’s a hardcoded check to see if the socket is
/var/run/crio/crio.sock
here. However, since/var/run
is symlinked to/run
, and the check is a string comparison, one can use/run/crio/crio.sock
and not trigger that toggle, turning on CRI stats provider. We’re working on fixing the perf troubles with CRI stats KEP 2371up
Another option is one could switch to the CRI stats provider by setting kubelet’s container_runtime_endpoint to
unix:///run/crio/crio.sock
instead ofunix:///var/run/crio/crio.sock
I can confirm this too, it seems that it happens with cgroups v2.
I’ve posted a quick gist with all the manifests I’m using to deploy a garbage collector daemon set. Nothing fancy, just a bit of patchwork to address the issues I was having. https://gist.github.com/aneagoe/6e18aaff48333ec059d0c1283b06813f
Yesterday we tested with the newly released version of
crio v1.24.0
and the problem was still here.After reading @aneagoe’s answer I digged a little bit because we use systemd to manage cgroups and it seemed the error was on our side. I updated the kernel boot command in the grub file
/etc/default/grub
because in order to have cgroupsv2 unified behavior in systemd you have to set both options:systemd.unified_cgroup_hierarchy=true
andcgroup_no_v1=all
and we only hadsystemd.unified_cgroup_hierarchy=true
:Used the following command for the changes to take effect:
And then rebooted the node.
At first it seemed like this did the trick as I popped a job on the node I changed and the kubelet didn’t spam the logfile anymore. As soon as I deleted the pod from the cluster the kubelet started spamming the logs again with:
So this did not resolve the issue. This was tested on a
Debian 11
node runningsystemd v247
,crio v1.22.3
, andkubernetes v1.22.7
.Using unix:///run/crio/crio.sock doesn’t solve the problem on my environment, we’re still seeing the same error logs. We’re running: Flatcar 3033.2.4 Kubelet 1.22.4 CRI-O 1.22.5 cgroup v2 enabled.
The issue in title happens on bare metal nodes too.
Tested with
Kubernetes v1.24.0
andcrio v1.23.2
on a fresh cluster bootstraped with kubeadm, all machines run on Debian 11. Ran a job 3 times and there’s a log line for every execution, so it spams 3 log lines per kubelet loop.Observing similar issues on
4.10.0-0.okd-2022-04-23-131357
(k8s and criov1.22.1
) and4.9.0-0.okd-2022-02-12-140851
(k8sv1.23.5+9ce5071
and crio1.23.2
). The strange thing is that one cluster just some nodes affected while the other, all of them. Below an overview of 30m occurrences ofFailed to update stats for container
Cluster running version4.10.0-0.okd-2022-04-23-131357
:Cluster running version
4.10.0-0.okd-2022-04-23-131357
:No idea where the discrepancy is coming from.
I’ve produced the same issue on my clusters.
I’m running Debian 11 on bare metal (homelab) and in VM (test). My clusters are installed using kubeadm, and I’m using multus and calico as CNI.
The issue is triggered as soon as a pod is removed. I’ve triggered it by deleting a simple Pod, scaling down a Deployment, or running a CronJob. On the node where the pod was removed, the logs are then flooded by
Unable to fetch pod log stats
messages, at the rate of 1 line per pod removed every 10 seconds.I’ve seen it both on aregular worker nodes, and untainted master nodes.
I’ve tested under Vagrant with various K8S / CRI-O versions.
With CRI-O
1.21.5
, I never triggered the issue with Kubernetes :1.21.10
,1.22.7
and1.23.4
.However, with CRI-O >=
1.22
, all my tests triggered the issue :1.22.1
, CRI-O1.22.1
1.22.7
, CRI-O1.22.2
1.23.4
, CRI-O1.23.1
Let me know if I can help by running other tests or capturing specific logs.
I have something similar on a fresh cluster I use for educational purposes using kubernetes 22.5 with cri-o. Here’s an excerpt:
This is from the master node. Nodes run Debian bulsseye in KVM provisioned with Vagrant. I have not yet tested it out with different versions. If I can assist in testing, I’d be happy to do so!