cri-o: No metrics reported from CRI-O: Unable to account working set stats: total_inactive_file > memory usage

What happened?

On a 3-node cluster deployed with kubeadm and using CRI-O, I am seeing a situation where after some amount of workload is allocated to a node, the node stops reporting any metrics (CPU/memory usage is 0% or <unknown> in kubectl top nodes). In the system journal for CRI-O, I can see these warnings being emitted every second or so on the affected nodes:

Mar 18 16:09:13 ram.fuwafuwatime.moe crio[72618]: time="2023-03-18 16:09:13.067279194-04:00" level=warning msg="Unable to account working set stats: total_inactive_file (3827683328) > memory usage (3233787904)"
Mar 18 16:09:18 ram.fuwafuwatime.moe crio[72618]: time="2023-03-18 16:09:18.919610520-04:00" level=warning msg="Unable to account working set stats: total_inactive_file (3827683328) > memory usage (3233787904)"
Mar 18 16:09:24 ram.fuwafuwatime.moe crio[72618]: time="2023-03-18 16:09:24.932763766-04:00" level=warning msg="Unable to account working set stats: total_inactive_file (3827683328) > memory usage (3233746944)"
Mar 18 16:09:31 ram.fuwafuwatime.moe crio[72618]: time="2023-03-18 16:09:31.687808594-04:00" level=warning msg="Unable to account working set stats: total_inactive_file (3827683328) > memory usage (3233759232)"

If I execute kubectl top nodes on the cluster, I get this output:

NAME                         CPU(cores)   CPU%        MEMORY(bytes)   MEMORY%     
frederica.fuwafuwatime.moe   1867m        23%         17545Mi         55%         
ram.fuwafuwatime.moe         <unknown>    <unknown>   <unknown>       <unknown>   
rem.fuwafuwatime.moe         <unknown>    <unknown>   <unknown>       <unknown>   

And in k9s, the pods running on these nodes report no CPU/memory usage (0%): image

What did you expect to happen?

Metrics should be visible from these nodes, regardless of workload.

How can we reproduce it (as minimally and precisely as possible)?

This is what I have done:

  1. Deploy a 3-node kubernetes cluster using kubeadm and CRI-O (one “master” node, 2 worker nodes, all 3 running control plane pods).
  2. Deploy Calico.
  3. Deploy metrics-server.
  4. Run some workloads on the cluster.

It seems most of the time this issue occurs on nodes that have at least (about) 50% memory utilization, but I haven’t found a reliable reproducer yet.

Anything else we need to know?

SELinux is enabled on the hosts in this cluster using Gentoo’s fork of refpolicy, but I have been able to reproduce this issue even when SELinux is in permissive mode.

CRI-O and Kubernetes version

$ crio --version
crio version 1.26.0
Version:        1.26.0
GitCommit:      unknown
GitCommitDate:  unknown
GitTreeState:   clean
BuildDate:      2022-12-25T22:53:52Z
GoVersion:      go1.19.4
Compiler:       gc
Platform:       linux/amd64
Linkmode:       dynamic
BuildTags:      
  containers_image_ostree_stub
  exclude_graphdriver_btrfs
  btrfs_noversion
  containers_image_openpgp
  seccomp
  selinux
LDFlags:          -s -w -X github.com/cri-o/cri-o/internal/pkg/criocli.DefaultsPath="" -X github.com/cri-o/cri-o/internal/version.buildDate=2022-12-25T22:53:52Z 
SeccompEnabled:   true
AppArmorEnabled:  false
Dependencies:     
  
$ kubectl version -o yaml
clientVersion:
  buildDate: "2023-03-10T00:50:26Z"
  compiler: gc
  gitCommit: fc04e732bb3e7198d2fa44efa5457c7c6f8c0f5b
  gitTreeState: archive
  gitVersion: v1.26.2
  goVersion: go1.20.1
  major: "1"
  minor: "26"
  platform: linux/amd64
kustomizeVersion: v4.5.7
serverVersion:
  buildDate: "2023-02-22T13:32:22Z"
  compiler: gc
  gitCommit: fc04e732bb3e7198d2fa44efa5457c7c6f8c0f5b
  gitTreeState: clean
  gitVersion: v1.26.2
  goVersion: go1.19.6
  major: "1"
  minor: "26"
  platform: linux/amd64

OS version

# On Linux:
$ cat /etc/os-release
NAME=Gentoo
ID=gentoo
PRETTY_NAME="Gentoo Linux"
ANSI_COLOR="1;32"
HOME_URL="https://www.gentoo.org/"
SUPPORT_URL="https://www.gentoo.org/support/"
BUG_REPORT_URL="https://bugs.gentoo.org/"
VERSION_ID="2.13"
$ uname -a
Linux ram.fuwafuwatime.moe 6.1.15-gentoo-hardened1 #1 SMP Tue Mar  7 19:50:39 EST 2023 x86_64 AMD EPYC 7313 16-Core Processor AuthenticAMD GNU/Linux

Additional environment details (AWS, VirtualBox, physical, etc.)

Bare metal nodes

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 24 (10 by maintainers)

Most upvoted comments

yeah this is a frequent source of confusion (and a piece of code I have issues about). The kubelet does a precise string match, not path match. so when the path is set to /run/crio/crio.sock, kubelet chooses cri stats provider, which hits this problem.