rke: Metrics-server can't scrape nodes after enable_cri_dockerd is set to true
RKE version: v1.3.1
Docker version: (docker version,docker info preferred)
Client: Docker Engine - Community
Version: 20.10.8
API version: 1.41
Go version: go1.16.6
Git commit: 3967b7d
Built: Fri Jul 30 19:53:39 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.8
API version: 1.41 (minimum version 1.12)
Go version: go1.16.6
Git commit: 75249d8
Built: Fri Jul 30 19:52:00 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.9
GitCommit: e25210fe30a0a703442421b0f60afac609f950a3
runc:
Version: 1.0.1
GitCommit: v1.0.1-0-g4144b63
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Operating system and kernel: (cat /etc/os-release, uname -r preferred)
NAME="CentOS Stream"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Stream 8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_SUPPORT_PRODUCT_VERSION="CentOS Stream"
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
Proxmox
cluster.yml file:
Maybe relevant lines:
system_images:
metrics_server: rancher/mirrored-metrics-server:v0.5.0
enable_cri_dockerd: true
Steps to Reproduce:
Update RKE v1.2.11 -> v1.3.1 (k8s v1.20.9 -> v1.21.5, metrics-server v0.4.1->v0.5.0)
Results:
After updating RKE, another metrics-server instance is created which is unable to start, can’t scrape the nodes, but the previous one is also present and it works fine with the previous version:
E0927 21:13:19.888512 1 scraper.go:139] "Failed to scrape node" err="Get \"https://192.168.1.19:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="node1"
E0927 21:13:19.888556 1 scraper.go:139] "Failed to scrape node" err="Get \"https://192.168.1.20:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="node2"
E0927 21:13:19.888591 1 scraper.go:139] "Failed to scrape node" err="Get \"https://192.168.1.21:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="node3"
I0927 21:13:24.433910 1 server.go:188] "Failed probe" probe="metric-storage-ready" err="not metrics to serve"
Diffing the old good and the new bad replicasets, the only relevant change seem to be the port change. Maybe a 4 digit is missing?
diff -u replicaset.yaml replicaset-bad.yaml
--- replicaset.yaml 2021-09-27 22:38:41.236441973 +0200
+++ replicaset-bad.yaml 2021-09-27 23:04:44.510924790 +0200
@@ -4,13 +4,14 @@
annotations:
deployment.kubernetes.io/desired-replicas: "1"
deployment.kubernetes.io/max-replicas: "2"
- deployment.kubernetes.io/revision: "4"
- creationTimestamp: "2021-05-09T14:00:19Z"
- generation: 1
+ deployment.kubernetes.io/revision: "8"
+ deployment.kubernetes.io/revision-history: "5"
+ creationTimestamp: "2021-09-27T20:11:58Z"
+ generation: 3
labels:
k8s-app: metrics-server
- pod-template-hash: 55fdd84cd4
- name: metrics-server-55fdd84cd4
+ pod-template-hash: 7bf4b68b78
+ name: metrics-server-7bf4b68b78
namespace: kube-system
ownerReferences:
- apiVersion: apps/v1
@@ -19,20 +20,20 @@
kind: Deployment
name: metrics-server
uid: 10b674cb-611f-4292-acc4-a4c095298cf2
- resourceVersion: "276893173"
- uid: 5e5cb232-85c7-45b0-b28c-a1e922eeac42
+ resourceVersion: "276904165"
+ uid: 2262ffba-08b7-498d-a32c-279b9c1c4a8e
spec:
replicas: 1
selector:
matchLabels:
k8s-app: metrics-server
- pod-template-hash: 55fdd84cd4
+ pod-template-hash: 7bf4b68b78
template:
metadata:
creationTimestamp: null
labels:
k8s-app: metrics-server
- pod-template-hash: 55fdd84cd4
+ pod-template-hash: 7bf4b68b78
name: metrics-server
spec:
affinity:
@@ -49,11 +50,12 @@
containers:
- args:
- --cert-dir=/tmp
- - --secure-port=4443
+ - --secure-port=443
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
+ - --metric-resolution=15s
- --logtostderr
- image: rancher/mirrored-metrics-server:v0.4.1
+ image: rancher/mirrored-metrics-server:v0.5.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
@@ -66,7 +68,7 @@
timeoutSeconds: 1
name: metrics-server
ports:
- - containerPort: 4443
+ - containerPort: 443
name: https
protocol: TCP
readinessProbe:
@@ -75,10 +77,14 @@
path: /readyz
port: https
scheme: HTTPS
+ initialDelaySeconds: 20
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
- resources: {}
+ resources:
+ requests:
+ cpu: 100m
+ memory: 200Mi
securityContext:
readOnlyRootFilesystem: true
runAsNonRoot: true
@@ -105,8 +111,6 @@
- emptyDir: {}
name: tmp-dir
status:
- availableReplicas: 1
fullyLabeledReplicas: 1
- observedGeneration: 1
- readyReplicas: 1
+ observedGeneration: 3
replicas: 1
I also did a Rancher app update from v2.5 latest patch release (can’t remember which version it was) to v2.6.0 with monitoring update 14.5.100->100.0.0+up16.6.0 but RKE is not provisioned from Rancher, I only use it to easily deploy the monitoring stack. The Rancher update was before the RKE update and metrics-server was untouched by the Rancher update. Only got it failing over v0.5.0 after the RKE update. v0.4.1 works fine even after manually deleting its replicaset and restoring it from a yaml file backup.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 2
- Comments: 22 (5 by maintainers)
@moray95 - This will be fixed in Rancher v2.6.7 according to https://github.com/rancher/rke/issues/2938