cri-o: No metrics reported from CRI-O: Unable to account working set stats: total_inactive_file > memory usage

What happened?

On a 3-node cluster deployed with kubeadm and using CRI-O, I am seeing a situation where after some amount of workload is allocated to a node, the node stops reporting any metrics (CPU/memory usage is 0% or <unknown> in kubectl top nodes). In the system journal for CRI-O, I can see these warnings being emitted every second or so on the affected nodes:

Mar 18 16:09:13 ram.fuwafuwatime.moe crio[72618]: time="2023-03-18 16:09:13.067279194-04:00" level=warning msg="Unable to account working set stats: total_inactive_file (3827683328) > memory usage (3233787904)"
Mar 18 16:09:18 ram.fuwafuwatime.moe crio[72618]: time="2023-03-18 16:09:18.919610520-04:00" level=warning msg="Unable to account working set stats: total_inactive_file (3827683328) > memory usage (3233787904)"
Mar 18 16:09:24 ram.fuwafuwatime.moe crio[72618]: time="2023-03-18 16:09:24.932763766-04:00" level=warning msg="Unable to account working set stats: total_inactive_file (3827683328) > memory usage (3233746944)"
Mar 18 16:09:31 ram.fuwafuwatime.moe crio[72618]: time="2023-03-18 16:09:31.687808594-04:00" level=warning msg="Unable to account working set stats: total_inactive_file (3827683328) > memory usage (3233759232)"

If I execute kubectl top nodes on the cluster, I get this output:

NAME                         CPU(cores)   CPU%        MEMORY(bytes)   MEMORY%     
frederica.fuwafuwatime.moe   1867m        23%         17545Mi         55%         
ram.fuwafuwatime.moe         <unknown>    <unknown>   <unknown>       <unknown>   
rem.fuwafuwatime.moe         <unknown>    <unknown>   <unknown>       <unknown>

And in k9s, the pods running on these nodes report no CPU/memory usage (0%):

What did you expect to happen?

Metrics should be visible from these nodes, regardless of workload.

How can we reproduce it (as minimally and precisely as possible)?

This is what I have done:

Deploy a 3-node kubernetes cluster using kubeadm and CRI-O (one “master” node, 2 worker nodes, all 3 running control plane pods).
Deploy Calico.
Deploy metrics-server.
Run some workloads on the cluster.

It seems most of the time this issue occurs on nodes that have at least (about) 50% memory utilization, but I haven’t found a reliable reproducer yet.

Anything else we need to know?

SELinux is enabled on the hosts in this cluster using Gentoo’s fork of refpolicy, but I have been able to reproduce this issue even when SELinux is in permissive mode.

CRI-O and Kubernetes version

$ crio --version
crio version 1.26.0
Version:        1.26.0
GitCommit:      unknown
GitCommitDate:  unknown
GitTreeState:   clean
BuildDate:      2022-12-25T22:53:52Z
GoVersion:      go1.19.4
Compiler:       gc
Platform:       linux/amd64
Linkmode:       dynamic
BuildTags:      
  containers_image_ostree_stub
  exclude_graphdriver_btrfs
  btrfs_noversion
  containers_image_openpgp
  seccomp
  selinux
LDFlags:          -s -w -X github.com/cri-o/cri-o/internal/pkg/criocli.DefaultsPath="" -X github.com/cri-o/cri-o/internal/version.buildDate=2022-12-25T22:53:52Z 
SeccompEnabled:   true
AppArmorEnabled:  false
Dependencies:

$ kubectl version -o yaml
clientVersion:
  buildDate: "2023-03-10T00:50:26Z"
  compiler: gc
  gitCommit: fc04e732bb3e7198d2fa44efa5457c7c6f8c0f5b
  gitTreeState: archive
  gitVersion: v1.26.2
  goVersion: go1.20.1
  major: "1"
  minor: "26"
  platform: linux/amd64
kustomizeVersion: v4.5.7
serverVersion:
  buildDate: "2023-02-22T13:32:22Z"
  compiler: gc
  gitCommit: fc04e732bb3e7198d2fa44efa5457c7c6f8c0f5b
  gitTreeState: clean
  gitVersion: v1.26.2
  goVersion: go1.19.6
  major: "1"
  minor: "26"
  platform: linux/amd64

OS version

# On Linux:
$ cat /etc/os-release
NAME=Gentoo
ID=gentoo
PRETTY_NAME="Gentoo Linux"
ANSI_COLOR="1;32"
HOME_URL="https://www.gentoo.org/"
SUPPORT_URL="https://www.gentoo.org/support/"
BUG_REPORT_URL="https://bugs.gentoo.org/"
VERSION_ID="2.13"
$ uname -a
Linux ram.fuwafuwatime.moe 6.1.15-gentoo-hardened1 #1 SMP Tue Mar  7 19:50:39 EST 2023 x86_64 AMD EPYC 7313 16-Core Processor AuthenticAMD GNU/Linux

Additional environment details (AWS, VirtualBox, physical, etc.)

Bare metal nodes

About this issue

Original URL
State: open
Created a year ago
Comments: 24 (10 by maintainers)

Most upvoted comments

yeah this is a frequent source of confusion (and a piece of code I have issues about). The kubelet does a precise string match, not path match. so when the path is set to /run/crio/crio.sock, kubelet chooses cri stats provider, which hits this problem.

haircommander on Jun 22, 2023