rancher: [BUG, RKE1, Monitoring V2] RKE1 1.24 seems to be omitting relevant cadvisor container labels and metric series that break Monitoring V2 dashboards
Rancher Server Setup
- Rancher version: v2.6.8
Information about the Cluster
- Kubernetes version: v1.24.2
- KRE v1.3.14
- rancher-monitoring:100.1.3+up19.0.3
Describe the bug
Since the last Rancher Update to 2.6.7 rancher monitoring pod metrics graphs shows “No data”. Update to Rancher 2.6.8 doesn’t fix that.
Grafana graph definition is using queries like this
container_memory_working_set_bytes{container!="POD",namespace=~"$namespace",pod=~"$pod", container!=""}
But that “container” flag is not longer shown by prometheus. The filter container!=""
prevents grafana from fetching data from prometheus.
If I remove that filter like this
container_memory_working_set_bytes{container!="POD",namespace=~"$namespace",pod=~"$pod"}
grafana is showing metric graphs again. At least until the grafana pod is restarted
I’ve tried to reinstall rancher monitoring but that doesn’t help either
Why this container
flag disappeared from prometheus? And how can I fix this.
SURE-5582
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 7
- Comments: 28 (11 by maintainers)
K8s 1.24 has removed the Docker plugin from cAdvisor. So while you can use cri-dockerd (Docker by Mirantis) to adjust the container runtime, kubelet can no longer retrieve Docker container information such as image, pod, container labels, etc. through cAdvisor.
I created a workaround that brings back the labels by creating a
My setup
cAdvisor standalone & ServiceMonitor yaml
Disable kubelet.serviceMonitor.cAdvisor in the rancher-monitoring chart
@xadcoh There’s an open issue in rke2 for this https://github.com/rancher/rke2/issues/1167. Based on https://github.com/rancher/rke2/issues/1167#issuecomment-1190065071 and https://github.com/rancher/rke2/issues/1167#issuecomment-1169034146, looks like containerd itself doesn’t report all the metrics and only disk metrics are supported:
The cadvisor only reports fs_inodes_free, fs_inodes_total, fs_usage_bytes and fs_limi_bytes for containerd https://github.com/google/cadvisor/pull/2936.
Pass Verified in
2.7.0-rc9
tried steps listed in https://github.com/rancher/rancher/issues/38934#issuecomment-1294585708 dashboard now has values moving to release notes now as this has been confirmed as a valid work around. also adding more dashboards to our regression testing@sowmyav27 @ronhorton , I’ve closed the forwardport created as I don’t think we should close this issue based on workaround. Please validate the workaround and send it back to “[zube]: Release Note” status. Once it’s release noted, we can bump it to one of the next milestone to properly address the issue after upstream addresses it.
Seems like fixing symptoms rather that the root problem. We have no issues with a similar
kube-prometheus-stack
installation based onkubeadm
managed kubernetes 1.24The problem is that Rancher’s cadvisor doesn’t publish
container
labels anymore, we upgraded from kubernetes 1.20 to 1.24 and stopped getting metrics.