rancher: High system load alert fails to trigger on 1.19.2

What kind of request is this (question/bug/enhancement/feature request): Bug

Steps to reproduce (least amount of steps as possible):

  1. Create a k3s v1.19.2 two node cluster. Used 2 VMs with 2vCPU and 8GB RAM.
  2. Import the cluster into Rancher.
  3. Enable monitoring
  4. Create a Notifier
  5. Edit the alert group for High CPU Load alert rule and add the notifier
  6. Create a workload that causes high cpu load on the cluster. I used the Folding at Home App in the Rancher catalog and set the CPU limit to 1500m

Result: The alert is never triggered, even though monitoring graphs showed CPU load well above 1.

Other details that may be helpful: Dug a little more into this and saw that k3s 1.18 clusters are exporting the machine_cpu_cores metric but k3s 1.19 clusters are not.

k3s 1.18.8: image k3s 1.19.2: image

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): v2.4.8
  • Installation option (single install/HA): HA

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Imported

  • Machine type (cloud/VM/metal) and specifications (CPU/memory): AWS EC2 m5a.large (2vCPU, 8GB RAM)

  • Kubernetes version (use kubectl version):

> kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.6", GitCommit:"d32e40e20d167e103faf894261614c5b45c44198", GitTreeState:"clean", BuildDate:"2020-05-20T13:16:24Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2+k3s1", GitCommit:"d38505b124c92bffd45f6e0654adb9371cae9610", GitTreeState:"clean", BuildDate:"2020-09-21T17:00:07Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
  • Docker version (use docker version): NA, containerd 1.4.0

gzrancher/rancher#12448

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 23 (19 by maintainers)

Commits related to this issue

Most upvoted comments

I commentd on the cadvisor pr mentioned and it looks like cadvisor itself is still publishing that datapoint. So, maybe upstream k8s is the next area to investigate?

@davidnuzik can you ask QA to attempt to repro on some other 1.19 distribution (I believe AKS has it available). This is lower priority for QA than any release related stuff, but rancher qa might have automation that can easily spin up such a cluster.

I poked around at this, because there seemed to be some contention around who should “own” this issue.

As @erikwilson stated, this is likely an upstream issue cuased by the bump from cadvisor 0.35 to 0.37 in k8s 1.19.

If you proxy your host via kubelet proxy, you can see clearly the cadvisor stats are no longer publishing this data point:

v1.18 k3s (k3d) cluster:

curl -s http://localhost:8001/api/v1/nodes/k3d-k3s-default-server-0/proxy/metrics/cadvisor | grep machine_cpu_cores
# HELP machine_cpu_cores Number of CPU cores on the machine.
# TYPE machine_cpu_cores gauge
machine_cpu_cores 6

v1.19 k3s (k3d) cluster:

curl -s http://localhost:8003/api/v1/nodes/k3d-new-server-0/proxy/metrics/cadvisor | grep machine_cpu_cores

A not-insignificant refactor of this code happened in cadvisor in this pr: https://github.com/google/cadvisor/pull/2444/files I wonder (without any real evidence other than the fact that this PR touched that area) if this could be the cause.

@dnoland1 - I am fairly convinced that this is an upstream issue, but I think we’ll need help from upstream to even prove that.

@erikwilson can you please open an issue in either upstream k8s or upstream cadvior or both for this? I think you might get a quicker response if you opened it in cadvisor with steps to repro that use k3d and show how to observe the problem via kubectl proxy, but I’ll leave it to your judgement.

I am going to move this issue to the k3s repo. It probably doesn’t belong in the charts repo.