rancher: [BUG, RKE1, Monitoring V2] RKE1 1.24 seems to be omitting relevant cadvisor container labels and metric series that break Monitoring V2 dashboards
Rancher Server Setup
- Rancher version: v2.6.8
Information about the Cluster
- Kubernetes version: v1.24.2
- KRE v1.3.14
- rancher-monitoring:100.1.3+up19.0.3
Describe the bug
Since the last Rancher Update to 2.6.7 rancher monitoring pod metrics graphs shows “No data”. Update to Rancher 2.6.8 doesn’t fix that.
Grafana graph definition is using queries like this
container_memory_working_set_bytes{container!="POD",namespace=~"$namespace",pod=~"$pod", container!=""}
But that “container” flag is not longer shown by prometheus. The filter container!="" prevents grafana from fetching data from prometheus.
If I remove that filter like this
container_memory_working_set_bytes{container!="POD",namespace=~"$namespace",pod=~"$pod"}
grafana is showing metric graphs again. At least until the grafana pod is restarted
I’ve tried to reinstall rancher monitoring but that doesn’t help either
Why this container flag disappeared from prometheus? And how can I fix this.
SURE-5582
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 7
- Comments: 28 (11 by maintainers)
K8s 1.24 has removed the Docker plugin from cAdvisor. So while you can use cri-dockerd (Docker by Mirantis) to adjust the container runtime, kubelet can no longer retrieve Docker container information such as image, pod, container labels, etc. through cAdvisor.
I created a workaround that brings back the labels by creating a
My setup
cAdvisor standalone & ServiceMonitor yaml
apiVersion: v1 kind: ServiceAccount metadata: labels: app: cadvisor name: cadvisor namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: app: cadvisor name: cadvisor rules: - apiGroups: - policy resourceNames: - cadvisor resources: - podsecuritypolicies verbs: - use --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: app: cadvisor name: cadvisor roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: cadvisor subjects: - kind: ServiceAccount name: cadvisor namespace: kube-system --- apiVersion: apps/v1 kind: DaemonSet metadata: annotations: seccomp.security.alpha.kubernetes.io/pod: docker/default labels: app: cadvisor name: cadvisor namespace: kube-system spec: selector: matchLabels: app: cadvisor name: cadvisor template: metadata: annotations: scheduler.alpha.kubernetes.io/critical-pod: "" labels: app: cadvisor name: cadvisor spec: automountServiceAccountToken: false containers: - args: - --housekeeping_interval=10s - --max_housekeeping_interval=15s - --event_storage_event_limit=default=0 - --event_storage_age_limit=default=0 - --enable_metrics=app,cpu,disk,diskIO,memory,network,process - --docker_only - --store_container_labels=false - --whitelisted_container_labels=io.kubernetes.container.name,io.kubernetes.pod.name,io.kubernetes.pod.namespace image: gcr.io/cadvisor/cadvisor:v0.45.0 name: cadvisor ports: - containerPort: 8080 name: http protocol: TCP resources: limits: cpu: 800m memory: 2000Mi requests: cpu: 400m memory: 400Mi volumeMounts: - mountPath: /rootfs name: rootfs readOnly: true - mountPath: /var/run name: var-run readOnly: true - mountPath: /sys name: sys readOnly: true - mountPath: /var/lib/docker name: docker readOnly: true - mountPath: /dev/disk name: disk readOnly: true priorityClassName: system-node-critical serviceAccountName: cadvisor terminationGracePeriodSeconds: 30 tolerations: - key: node-role.kubernetes.io/controlplane value: "true" effect: NoSchedule - key: node-role.kubernetes.io/etcd value: "true" effect: NoExecute volumes: - hostPath: path: / name: rootfs - hostPath: path: /var/run name: var-run - hostPath: path: /sys name: sys - hostPath: path: /var/lib/docker name: docker - hostPath: path: /dev/disk name: disk --- apiVersion: policy/v1beta1 kind: PodSecurityPolicy metadata: labels: app: cadvisor name: cadvisor namespace: kube-system spec: allowedHostPaths: - pathPrefix: / - pathPrefix: /var/run - pathPrefix: /sys - pathPrefix: /var/lib/docker - pathPrefix: /dev/disk fsGroup: rule: RunAsAny runAsUser: rule: RunAsAny seLinux: rule: RunAsAny supplementalGroups: rule: RunAsAny volumes: - '*' --- apiVersion: v1 kind: Service metadata: name: cadvisor labels: app: cadvisor namespace: kube-system spec: selector: app: cadvisor ports: - name: cadvisor port: 8080 protocol: TCP targetPort: 8080 --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: app: cadvisor name: cadvisor namespace: kube-system spec: endpoints: - metricRelabelings: - sourceLabels: - container_label_io_kubernetes_pod_name targetLabel: pod - sourceLabels: - container_label_io_kubernetes_container_name targetLabel: container - sourceLabels: - container_label_io_kubernetes_pod_namespace targetLabel: namespace - action: labeldrop regex: container_label_io_kubernetes_pod_name - action: labeldrop regex: container_label_io_kubernetes_container_name - action: labeldrop regex: container_label_io_kubernetes_pod_namespace port: cadvisor relabelings: - sourceLabels: - __meta_kubernetes_pod_node_name targetLabel: node - sourceLabels: - __metrics_path__ targetLabel: metrics_path replacement: /metrics/cadvisor - sourceLabels: - job targetLabel: job replacement: kubelet namespaceSelector: matchNames: - kube-system selector: matchLabels: app: cadvisorDisable kubelet.serviceMonitor.cAdvisor in the rancher-monitoring chart
kubelet: serviceMonitor: cAdvisor: false@xadcoh There’s an open issue in rke2 for this https://github.com/rancher/rke2/issues/1167. Based on https://github.com/rancher/rke2/issues/1167#issuecomment-1190065071 and https://github.com/rancher/rke2/issues/1167#issuecomment-1169034146, looks like containerd itself doesn’t report all the metrics and only disk metrics are supported:
The cadvisor only reports fs_inodes_free, fs_inodes_total, fs_usage_bytes and fs_limi_bytes for containerd https://github.com/google/cadvisor/pull/2936.Pass Verified in
2.7.0-rc9tried steps listed in https://github.com/rancher/rancher/issues/38934#issuecomment-1294585708 dashboard now has values moving to release notes now as this has been confirmed as a valid work around. also adding more dashboards to our regression testing@sowmyav27 @ronhorton , I’ve closed the forwardport created as I don’t think we should close this issue based on workaround. Please validate the workaround and send it back to “[zube]: Release Note” status. Once it’s release noted, we can bump it to one of the next milestone to properly address the issue after upstream addresses it.
Seems like fixing symptoms rather that the root problem. We have no issues with a similar
kube-prometheus-stackinstallation based onkubeadmmanaged kubernetes 1.24The problem is that Rancher’s cadvisor doesn’t publish
containerlabels anymore, we upgraded from kubernetes 1.20 to 1.24 and stopped getting metrics.