kubernetes: Kubelet spamming 'Unable to fetch pod log stats' log messages after running cronjobs

What happened?

I have a cronjob that runs a Python script every 5 minutes. After the cronjob finishes running, I start seeing a bunch of log messages in my systemd journal like the following:

Dec 10 21:13:26 gb-7983 kubelet[1386178]: E1210 21:13:26.048459 1386178 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/k8s-homelab_transmission-manager-27319165--1-n2xq8_0a0270b6-d650-453d-96db-2fbb7c43ba42: no such file or directory" pod="k8s-homelab/transmission-manager-27319165--1-n2xq8"
Dec 10 21:13:26 gb-7983 kubelet[1386178]: E1210 21:13:26.049381 1386178 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/k8s-homelab_transmission-manager-27319175--1-frnvv_4c1ba4d5-1098-4644-ad4e-15eaa6e76767: no such file or directory" pod="k8s-homelab/transmission-manager-27319175--1-frnvv"
Dec 10 21:13:26 gb-7983 kubelet[1386178]: E1210 21:13:26.050750 1386178 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/k8s-homelab_transmission-manager-27319170--1-k7dwv_7a94a981-25e0-4079-a60f-9ed00bb859bb: no such file or directory" pod="k8s-homelab/transmission-manager-27319170--1-k7dwv"

Those directories in /var/log don’t exist as they belong to old pods that have finished running.

These messages keep increasing in volume as time passes (almost received a 1000 of those messages at one time, after which I had to reboot the node to fix the issue). It almost looks like the kubelet is not cleaning up the old pods for some reason. I can’t see these old pods when I run kubectl get pods -A, so I’m assuming their is some issue with the kubelet’s cleanup process.

What did you expect to happen?

kubelet (or cadvisor) shouldn’t try to read logs for pods that don’t exist.

How can we reproduce it (as minimally and precisely as possible)?

Setup a short lived cronjob running every 5 minutes and observe the logs (Kubernetes v1.22.4, CRI-O v1.22.1, Ubuntu 20.04)

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

Baremetal cluster

OS version

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

$ uname -a
Linux gb-7983 5.11.0-41-generic #45~20.04.1-Ubuntu SMP Wed Nov 10 10:20:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Install tools

kubeadm (baremetal cluster)

Container runtime (CRI) and and version (if applicable)

CRI-O v1.22.1

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 16
  • Comments: 42 (12 by maintainers)

Commits related to this issue

Most upvoted comments

Is this also an option for OKD/OCP? Is there any progress on a fix or should we look at the above workaround instead?

This would work for OKD, but wouldn’t be supported for OCP OOTB. https://github.com/kubernetes/kubernetes/pull/108855 should fix, but it needs a reviewer/approver…

Out of curiosity, why does this work?

Fun kubernetes fact! The CRI-O team found performance issues with the CRI stats provider, so there’s a hardcoded check to see if the socket is /var/run/crio/crio.sock here. However, since /var/run is symlinked to /run, and the check is a string comparison, one can use /run/crio/crio.sock and not trigger that toggle, turning on CRI stats provider. We’re working on fixing the perf troubles with CRI stats KEP 2371

up

Another option is one could switch to the CRI stats provider by setting kubelet’s container_runtime_endpoint to unix:///run/crio/crio.sock instead of unix:///var/run/crio/crio.sock

I can confirm this too, it seems that it happens with cgroups v2.

I’ve posted a quick gist with all the manifests I’m using to deploy a garbage collector daemon set. Nothing fancy, just a bit of patchwork to address the issues I was having. https://gist.github.com/aneagoe/6e18aaff48333ec059d0c1283b06813f

Yesterday we tested with the newly released version of crio v1.24.0 and the problem was still here.

After reading @aneagoe’s answer I digged a little bit because we use systemd to manage cgroups and it seemed the error was on our side. I updated the kernel boot command in the grub file /etc/default/grub because in order to have cgroupsv2 unified behavior in systemd you have to set both options: systemd.unified_cgroup_hierarchy=true and cgroup_no_v1=all and we only had systemd.unified_cgroup_hierarchy=true:

GRUB_CMDLINE_LINUX_DEFAULT="... systemd.unified_cgroup_hierarchy=true cgroup_no_v1=all ..."

Used the following command for the changes to take effect:

update-grub

And then rebooted the node.

At first it seemed like this did the trick as I popped a job on the node I changed and the kubelet didn’t spam the logfile anymore. As soon as I deleted the pod from the cluster the kubelet started spamming the logs again with:

E0512 08:19:31.126396     837 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/default_pi--1-6v5sp_b77b46c1-db08-4a0b-a32a-c848a8140d9a: no such file or directory" pod="default/pi--1-6v5sp"
E0512 08:19:31.126499     837 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/default_pi--1-6v5sp_b77b46c1-db08-4a0b-a32a-c848a8140d9a: no such file or directory" pod="default/pi--1-6v5sp"
W0512 08:19:33.972387     837 container.go:586] Failed to update stats for container "/pids/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podb77b46c1_db08_4a0b_a32a_c848a8140d9a.slice/crio-95c2bc9032e318fc8b378d0357c730291670f82d3de74e1d37e34127c30de291.scope": unable to determine device info for dir: /var/lib/containers/storage/overlay/ba4f383e39d68740731c1b7b0d9214c1d4451d8c2d210f0f77c5840eef1adbae/diff: stat failed on /var/lib/containers/storage/overlay/ba4f383e39d68740731c1b7b0d9214c1d4451d8c2d210f0f77c5840eef1adbae/diff with error: no such file or directory, continuing to push stats
E0512 08:19:36.122734     837 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/default_pi--1-6v5sp_b77b46c1-db08-4a0b-a32a-c848a8140d9a: no such file or directory" pod="default/pi--1-6v5sp"
E0512 08:19:36.123600     837 cadvisor_stats_provider.go:147] "Unable to fetch pod log stats" err="open /var/log/pods/default_pi--1-6v5sp_b77b46c1-db08-4a0b-a32a-c848a8140d9a: no such file or directory" pod="default/pi--1-6v5sp"

So this did not resolve the issue. This was tested on a Debian 11 node running systemd v247, crio v1.22.3, and kubernetes v1.22.7.

Using unix:///run/crio/crio.sock doesn’t solve the problem on my environment, we’re still seeing the same error logs. We’re running: Flatcar 3033.2.4 Kubelet 1.22.4 CRI-O 1.22.5 cgroup v2 enabled.

W0810 11:05:50.707417 3143935 container.go:586] Failed to update stats for container "/pids/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6c2c7965_7e29_4ed4_8e7f_b66cfaca4c22.slice/crio-14c3ee4d1d08de5a88a28dc24c9189977cf870c03551e1502bccf9d07fc46f45.scope": unable to determine device info for dir: /var/lib/containers/storage/overlay/f5aa147151a03909ac056604b64cfa7776874443a855d9c4eeb6e4c70cd10e39/diff: stat failed on /var/lib/containers/storage/overlay/f5aa147151a03909ac056604b64cfa7776874443a855d9c4eeb6e4c70cd10e39/diff with error: no such file or directory, continuing to push stats

Here’s another really interesting bit of information. This issue seems to be only affecting VMs and not bare-metal nodes. It’s completely puzzling, but I have 12 bare-metal nodes and 1 VM node in one cluster (the VM is just for testing) and it manifests only on the VM node. As well, my test cluster is made up fully of VMs (ie no bare-metal) and there it happens across the board. The bare metal nodes where the VMs are running are the same are same hardware as the bare-metal nodes of my main cluster (just more NVMe drives).

The issue in title happens on bare metal nodes too.

Tested with Kubernetes v1.24.0 and crio v1.23.2 on a fresh cluster bootstraped with kubeadm, all machines run on Debian 11. Ran a job 3 times and there’s a log line for every execution, so it spams 3 log lines per kubelet loop.

Observing similar issues on 4.10.0-0.okd-2022-04-23-131357 (k8s and crio v1.22.1) and 4.9.0-0.okd-2022-02-12-140851 (k8s v1.23.5+9ce5071 and crio 1.23.2). The strange thing is that one cluster just some nodes affected while the other, all of them. Below an overview of 30m occurrences of Failed to update stats for container Cluster running version 4.10.0-0.okd-2022-04-23-131357:

worker-01:
0
worker-02:
39
worker-03:
0
worker-04:
3
worker-05:
0
worker-06:
0
worker-07:
0
worker-08:
0
worker-09:
3
worker-10:
2
worker-11:
0
worker-12:
13519

Cluster running version 4.10.0-0.okd-2022-04-23-131357:

worker-01:
13419
worker-02:
24689

No idea where the discrepancy is coming from.

I’ve produced the same issue on my clusters.

I’m running Debian 11 on bare metal (homelab) and in VM (test). My clusters are installed using kubeadm, and I’m using multus and calico as CNI.

The issue is triggered as soon as a pod is removed. I’ve triggered it by deleting a simple Pod, scaling down a Deployment, or running a CronJob. On the node where the pod was removed, the logs are then flooded by Unable to fetch pod log stats messages, at the rate of 1 line per pod removed every 10 seconds.

I’ve seen it both on aregular worker nodes, and untainted master nodes.

I’ve tested under Vagrant with various K8S / CRI-O versions.

With CRI-O 1.21.5, I never triggered the issue with Kubernetes : 1.21.10, 1.22.7 and 1.23.4.

However, with CRI-O >= 1.22, all my tests triggered the issue :

  • Kubernetes 1.22.1, CRI-O 1.22.1
  • Kubernetes 1.22.7, CRI-O 1.22.2
  • Kubernetes 1.23.4, CRI-O 1.23.1

Let me know if I can help by running other tests or capturing specific logs.

I have something similar on a fresh cluster I use for educational purposes using kubernetes 22.5 with cri-o. Here’s an excerpt:

Dec 28 19:14:31 bullseye kubelet[26386]: E1228 20:14:31.603845   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-fbnxf_001de348-1461-4890-8b87-5cac786d0d91: no such file or directory"od="kube-system/coredns-78fcd69978-fbnxf"
Dec 28 19:14:31 bullseye kubelet[26386]: E1228 20:14:31.603884   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-zsrx5_93903455-a9a8-4aa8-8efc-93fb3bffc5d3: no such file or directory"od="kube-system/coredns-78fcd69978-zsrx5"
Dec 28 19:14:41 bullseye kubelet[26386]: E1228 20:14:41.611317   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-zsrx5_93903455-a9a8-4aa8-8efc-93fb3bffc5d3: no such file or directory"od="kube-system/coredns-78fcd69978-zsrx5"
Dec 28 19:14:41 bullseye kubelet[26386]: E1228 20:14:41.611538   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-fbnxf_001de348-1461-4890-8b87-5cac786d0d91: no such file or directory"od="kube-system/coredns-78fcd69978-fbnxf"
Dec 28 19:14:51 bullseye kubelet[26386]: E1228 20:14:51.626880   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-zsrx5_93903455-a9a8-4aa8-8efc-93fb3bffc5d3: no such file or directory"od="kube-system/coredns-78fcd69978-zsrx5"
Dec 28 19:14:51 bullseye kubelet[26386]: E1228 20:14:51.627030   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-fbnxf_001de348-1461-4890-8b87-5cac786d0d91: no such file or directory"od="kube-system/coredns-78fcd69978-fbnxf"
Dec 28 19:15:01 bullseye kubelet[26386]: E1228 20:15:01.642047   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-fbnxf_001de348-1461-4890-8b87-5cac786d0d91: no such file or directory"od="kube-system/coredns-78fcd69978-fbnxf"
Dec 28 19:15:01 bullseye kubelet[26386]: E1228 20:15:01.643417   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-zsrx5_93903455-a9a8-4aa8-8efc-93fb3bffc5d3: no such file or directory"od="kube-system/coredns-78fcd69978-zsrx5"
Dec 28 19:15:11 bullseye kubelet[26386]: E1228 20:15:11.661392   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-fbnxf_001de348-1461-4890-8b87-5cac786d0d91: no such file or directory"od="kube-system/coredns-78fcd69978-fbnxf"
Dec 28 19:15:11 bullseye kubelet[26386]: E1228 20:15:11.661681   26386 cadvisor_stats_provider.go:147] "Unable to fetchod log stats" err="open /var/log/pods/kube-system_coredns-78fcd69978-zsrx5_93903455-a9a8-4aa8-8efc-93fb3bffc5d3: no such file or directory"od="kube-system/coredns-78fcd69978-zsrx5"

This is from the master node. Nodes run Debian bulsseye in KVM provisioned with Vagrant. I have not yet tested it out with different versions. If I can assist in testing, I’d be happy to do so!