rancher: High system load alert fails to trigger on 1.19.2
What kind of request is this (question/bug/enhancement/feature request): Bug
Steps to reproduce (least amount of steps as possible):
- Create a k3s v1.19.2 two node cluster. Used 2 VMs with 2vCPU and 8GB RAM.
- Import the cluster into Rancher.
- Enable monitoring
- Create a Notifier
- Edit the alert group for High CPU Load alert rule and add the notifier
- Create a workload that causes high cpu load on the cluster. I used the Folding at Home App in the Rancher catalog and set the CPU limit to 1500m
Result: The alert is never triggered, even though monitoring graphs showed CPU load well above 1.
Other details that may be helpful:
Dug a little more into this and saw that k3s 1.18 clusters are exporting the machine_cpu_cores
metric but k3s 1.19 clusters are not.
k3s 1.18.8: k3s 1.19.2:
Environment information
- Rancher version (
rancher/rancher
/rancher/server
image tag or shown bottom left in the UI): v2.4.8 - Installation option (single install/HA): HA
Cluster information
-
Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Imported
-
Machine type (cloud/VM/metal) and specifications (CPU/memory): AWS EC2 m5a.large (2vCPU, 8GB RAM)
-
Kubernetes version (use
kubectl version
):
> kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.6", GitCommit:"d32e40e20d167e103faf894261614c5b45c44198", GitTreeState:"clean", BuildDate:"2020-05-20T13:16:24Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2+k3s1", GitCommit:"d38505b124c92bffd45f6e0654adb9371cae9610", GitTreeState:"clean", BuildDate:"2020-09-21T17:00:07Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
- Docker version (use
docker version
): NA, containerd 1.4.0
gzrancher/rancher#12448
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 23 (19 by maintainers)
Commits related to this issue
- Fix for machine_cpu_cores being removed from cadvisor metrics See also https://github.com/rancher/rancher/issues/29292 — committed to rancher/rancher by dnoland1 4 years ago
- Fix for machine_cpu_cores being removed from cadvisor metrics See also https://github.com/rancher/rancher/issues/29292 — committed to rancher/rancher by dnoland1 4 years ago
I commentd on the cadvisor pr mentioned and it looks like cadvisor itself is still publishing that datapoint. So, maybe upstream k8s is the next area to investigate?
@davidnuzik can you ask QA to attempt to repro on some other 1.19 distribution (I believe AKS has it available). This is lower priority for QA than any release related stuff, but rancher qa might have automation that can easily spin up such a cluster.
I poked around at this, because there seemed to be some contention around who should “own” this issue.
As @erikwilson stated, this is likely an upstream issue cuased by the bump from cadvisor 0.35 to 0.37 in k8s 1.19.
If you proxy your host via
kubelet proxy
, you can see clearly the cadvisor stats are no longer publishing this data point:v1.18 k3s (k3d) cluster:
v1.19 k3s (k3d) cluster:
A not-insignificant refactor of this code happened in cadvisor in this pr: https://github.com/google/cadvisor/pull/2444/files I wonder (without any real evidence other than the fact that this PR touched that area) if this could be the cause.
@dnoland1 - I am fairly convinced that this is an upstream issue, but I think we’ll need help from upstream to even prove that.
@erikwilson can you please open an issue in either upstream k8s or upstream cadvior or both for this? I think you might get a quicker response if you opened it in cadvisor with steps to repro that use k3d and show how to observe the problem via
kubectl proxy
, but I’ll leave it to your judgement.I am going to move this issue to the k3s repo. It probably doesn’t belong in the charts repo.