kubernetes: Kubernetes API server stuck on metrics server API service discovery failure when The machine power is off or the network is disconnected

What happened:

When the metrics-server pod is running the worker node, the power is disconnected or the network is disconnected，the metrics-serve pod is started on other working nodes, and kube-apiserver cannot connect to the metrics-server

kube-apiserver log:

1203 10:24:04.246085       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:24:09.246520       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:24:34.246193       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:24:39.246591       1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

kubectl get pods -o wide -n kube-system

NAME                                      READY   STATUS     RESTARTS   AGE     IP               NODE      NOMINATED NODE   READINESS GATES
calico-kube-controllers-bbdc58449-cgsbt   1/1     Running    5          22d     10.244.137.79    master1   <none>           <none>
calico-node-2qk2b                         1/1     NodeLost   5          21d     192.168.210.74   worker1   <none>           <none>
calico-node-gjdrw                         1/1     Running    5          22d     192.168.210.73   master3   <none>           <none>
calico-node-jwbnz                         1/1     Running    4          22d     192.168.210.72   master2   <none>           <none>
calico-node-svmv7                         1/1     Running    5          22d     192.168.210.71   master1   <none>           <none>
coredns-85d448b787-8nks7                  1/1     Running    5          22d     10.244.137.87    master1   <none>           <none>
coredns-85d448b787-lpxtt                  1/1     Running    5          22d     10.244.137.80    master1   <none>           <none>
etcd-master1                              1/1     Running    5          22d     192.168.210.71   master1   <none>           <none>
kube-apiserver-master1                    1/1     Running    2          7d8h    192.168.210.71   master1   <none>           <none>
kube-controller-manager-master1           1/1     Running    9          22d     192.168.210.71   master1   <none>           <none>
kube-proxy-7wkbr                          1/1     Running    5          22d     192.168.210.71   master1   <none>           <none>
kube-proxy-8d7dj                          1/1     Running    4          22d     192.168.210.72   master2   <none>           <none>
kube-proxy-nhdsn                          1/1     NodeLost   4          21d     192.168.210.74   worker1   <none>           <none>
kube-proxy-sfbjm                          1/1     Running    4          22d     192.168.210.73   master3   <none>           <none>
kube-scheduler-master1                    1/1     Running    9          22d     192.168.210.71   master1   <none>           <none>
metrics-server-8b7689b66-xm6mf            1/1     Running    0          36s     10.244.180.55    master2   <none>           <none>
metrics-server-8b7689b66-z9hk9            1/1     Unknown    0          3m58s   10.244.235.186   worker1   <none>           <none>
tiller-deploy-5fd994b8f-twpn2             1/1     Running    0          3h13m   10.244.180.23    master2   <none>

kubectl get apiservice | grep metrics

v1beta1.metrics.k8s.io kube-system/metrics-server False (FailedDiscoveryCheck) 22d

podip 10.244.235.186 failure，podip 10.244.180.55 normal

metrics-server-8b7689b66-xm6mf            1/1     Running    0          36s      10.244.180.55 master2   <none>           <none>
metrics-server-8b7689b66-z9hk9            1/1     Unknown    0          3m58s   10.244.235.186  worker1   <none>           <none>

kubectl get svc -n kube-system

NAME                               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
metrics-server                     ClusterIP   10.101.186.48  <none>        443/TCP                  21d

Pointing to the wrong podip 10.244.235.186 conntrack -L | grep 10.101.186.48

tcp      6 278 ESTABLISHED src=10.101.186.48 dst=10.101.186.48 sport=45842 dport=443 src=10.244.235.186 dst=192.168.210.71 sport=443 dport=19158 [ASSURED] mark=0 use=1
tcp      6 298 ESTABLISHED src=10.101.186.48 dst=10.101.186.48 sport=45820 dport=443 src=10.244.235.186 dst=192.168.210.71 sport=443 dport=15276 [ASSURED] mark=0 use=2

ipvsadm -Ln

TCP  10.101.186.48:443 rr
  -> 10.244.180.55:443            Masq    1      0          0         
  -> 10.244.235.186:443           Masq    0      2          0

ipvsadm -lnc | grep 10.101.186.48

TCP 14:58  ESTABLISHED 10.101.186.48:56312 10.101.186.48:443  10.244.235.186:443
TCP 00:18  CLOSE_WAIT  10.101.186.48:56328 10.101.186.48:443  10.244.235.186:443

After 15 minutes, you can connect to the correct pod or restart kube-apiserver

I tried the following method to be effective, modified the kernel from net.ipv4.tcp_retries2=15 to net.ipv4.tcp_retries2=1, in the case of power failure, it can be specified to 10.244.180.55 after 1min release

Use shutdown to shut down the system without this bug Environment:

Kubernetes version (use kubectl version): 1.16.10
Cloud provider or hardware configuration: Vmware
OS (e.g: cat /etc/os-release): centos 7.5
Kernel (e.g. uname -a): 3.10
Install tools: Cloud provider managed service install tools
Network plugin and version (if this is a network-related bug): Calico v3.9.3
Others: None.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 17 (7 by maintainers)

Most upvoted comments

Please upgrade Metrics Server image to v0.4.1 and verify if it’s affected by same problem.

serathius on Dec 9, 2020