rke: Metrics-server can't scrape nodes after enable_cri_dockerd is set to true

RKE version: v1.3.1

Docker version: (docker version,docker info preferred)

Client: Docker Engine - Community
 Version:           20.10.8
 API version:       1.41
 Go version:        go1.16.6
 Git commit:        3967b7d
 Built:             Fri Jul 30 19:53:39 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.8
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.6
  Git commit:       75249d8
  Built:            Fri Jul 30 19:52:00 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.9
  GitCommit:        e25210fe30a0a703442421b0f60afac609f950a3
 runc:
  Version:          1.0.1
  GitCommit:        v1.0.1-0-g4144b63
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

NAME="CentOS Stream"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Stream 8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_SUPPORT_PRODUCT_VERSION="CentOS Stream"

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)

Proxmox

cluster.yml file:

Maybe relevant lines:

system_images:
  metrics_server: rancher/mirrored-metrics-server:v0.5.0

enable_cri_dockerd: true

Steps to Reproduce:

Update RKE v1.2.11 -> v1.3.1 (k8s v1.20.9 -> v1.21.5, metrics-server v0.4.1->v0.5.0)

Results:

After updating RKE, another metrics-server instance is created which is unable to start, can’t scrape the nodes, but the previous one is also present and it works fine with the previous version:

E0927 21:13:19.888512       1 scraper.go:139] "Failed to scrape node" err="Get \"https://192.168.1.19:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="node1"                                         
E0927 21:13:19.888556       1 scraper.go:139] "Failed to scrape node" err="Get \"https://192.168.1.20:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="node2"                                         
E0927 21:13:19.888591       1 scraper.go:139] "Failed to scrape node" err="Get \"https://192.168.1.21:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="node3"                                         
I0927 21:13:24.433910       1 server.go:188] "Failed probe" probe="metric-storage-ready" err="not metrics to serve"

Diffing the old good and the new bad replicasets, the only relevant change seem to be the port change. Maybe a 4 digit is missing?

diff -u replicaset.yaml replicaset-bad.yaml 
--- replicaset.yaml     2021-09-27 22:38:41.236441973 +0200
+++ replicaset-bad.yaml 2021-09-27 23:04:44.510924790 +0200
@@ -4,13 +4,14 @@
   annotations:
     deployment.kubernetes.io/desired-replicas: "1"
     deployment.kubernetes.io/max-replicas: "2"
-    deployment.kubernetes.io/revision: "4"
-  creationTimestamp: "2021-05-09T14:00:19Z"
-  generation: 1
+    deployment.kubernetes.io/revision: "8"
+    deployment.kubernetes.io/revision-history: "5"
+  creationTimestamp: "2021-09-27T20:11:58Z"
+  generation: 3
   labels:
     k8s-app: metrics-server
-    pod-template-hash: 55fdd84cd4
-  name: metrics-server-55fdd84cd4
+    pod-template-hash: 7bf4b68b78
+  name: metrics-server-7bf4b68b78
   namespace: kube-system
   ownerReferences:
   - apiVersion: apps/v1
@@ -19,20 +20,20 @@
     kind: Deployment
     name: metrics-server
     uid: 10b674cb-611f-4292-acc4-a4c095298cf2
-  resourceVersion: "276893173"
-  uid: 5e5cb232-85c7-45b0-b28c-a1e922eeac42
+  resourceVersion: "276904165"
+  uid: 2262ffba-08b7-498d-a32c-279b9c1c4a8e
 spec:
   replicas: 1
   selector:
     matchLabels:
       k8s-app: metrics-server
-      pod-template-hash: 55fdd84cd4
+      pod-template-hash: 7bf4b68b78
   template:
     metadata:
       creationTimestamp: null
       labels:
         k8s-app: metrics-server
-        pod-template-hash: 55fdd84cd4
+        pod-template-hash: 7bf4b68b78
       name: metrics-server
     spec:
       affinity:
@@ -49,11 +50,12 @@
       containers:
       - args:
         - --cert-dir=/tmp
-        - --secure-port=4443
+        - --secure-port=443
         - --kubelet-insecure-tls
         - --kubelet-preferred-address-types=InternalIP
+        - --metric-resolution=15s
         - --logtostderr
-        image: rancher/mirrored-metrics-server:v0.4.1
+        image: rancher/mirrored-metrics-server:v0.5.0
         imagePullPolicy: IfNotPresent
         livenessProbe:
           failureThreshold: 3
@@ -66,7 +68,7 @@
           timeoutSeconds: 1
         name: metrics-server
         ports:
-        - containerPort: 4443
+        - containerPort: 443
           name: https
           protocol: TCP
         readinessProbe:
@@ -75,10 +77,14 @@
             path: /readyz
             port: https
             scheme: HTTPS
+          initialDelaySeconds: 20
           periodSeconds: 10
           successThreshold: 1
           timeoutSeconds: 1
-        resources: {}
+        resources:
+          requests:
+            cpu: 100m
+            memory: 200Mi
         securityContext:
           readOnlyRootFilesystem: true
           runAsNonRoot: true
@@ -105,8 +111,6 @@
       - emptyDir: {}
         name: tmp-dir
 status:
-  availableReplicas: 1
   fullyLabeledReplicas: 1
-  observedGeneration: 1
-  readyReplicas: 1
+  observedGeneration: 3
   replicas: 1

I also did a Rancher app update from v2.5 latest patch release (can’t remember which version it was) to v2.6.0 with monitoring update 14.5.100->100.0.0+up16.6.0 but RKE is not provisioned from Rancher, I only use it to easily deploy the monitoring stack. The Rancher update was before the RKE update and metrics-server was untouched by the Rancher update. Only got it failing over v0.5.0 after the RKE update. v0.4.1 works fine even after manually deleting its replicaset and restoring it from a yaml file backup.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 22 (5 by maintainers)

Most upvoted comments

@moray95 - This will be fixed in Rancher v2.6.7 according to https://github.com/rancher/rke/issues/2938

stalin4suse on Sep 2, 2022