kubernetes: Kubernetes API server stuck on metrics server API service discovery failure when The machine power is off or the network is disconnected
What happened:
When the metrics-server pod is running the worker node, the power is disconnected or the network is disconnected,the metrics-serve pod is started on other working nodes, and kube-apiserver cannot connect to the metrics-server
kube-apiserver log:
1203 10:24:04.246085 1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:24:09.246520 1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:24:34.246193 1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1203 10:24:39.246591 1 available_controller.go:416] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: Get https://10.101.186.48:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
kubectl get pods -o wide -n kube-system
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-bbdc58449-cgsbt 1/1 Running 5 22d 10.244.137.79 master1 <none> <none>
calico-node-2qk2b 1/1 NodeLost 5 21d 192.168.210.74 worker1 <none> <none>
calico-node-gjdrw 1/1 Running 5 22d 192.168.210.73 master3 <none> <none>
calico-node-jwbnz 1/1 Running 4 22d 192.168.210.72 master2 <none> <none>
calico-node-svmv7 1/1 Running 5 22d 192.168.210.71 master1 <none> <none>
coredns-85d448b787-8nks7 1/1 Running 5 22d 10.244.137.87 master1 <none> <none>
coredns-85d448b787-lpxtt 1/1 Running 5 22d 10.244.137.80 master1 <none> <none>
etcd-master1 1/1 Running 5 22d 192.168.210.71 master1 <none> <none>
kube-apiserver-master1 1/1 Running 2 7d8h 192.168.210.71 master1 <none> <none>
kube-controller-manager-master1 1/1 Running 9 22d 192.168.210.71 master1 <none> <none>
kube-proxy-7wkbr 1/1 Running 5 22d 192.168.210.71 master1 <none> <none>
kube-proxy-8d7dj 1/1 Running 4 22d 192.168.210.72 master2 <none> <none>
kube-proxy-nhdsn 1/1 NodeLost 4 21d 192.168.210.74 worker1 <none> <none>
kube-proxy-sfbjm 1/1 Running 4 22d 192.168.210.73 master3 <none> <none>
kube-scheduler-master1 1/1 Running 9 22d 192.168.210.71 master1 <none> <none>
metrics-server-8b7689b66-xm6mf 1/1 Running 0 36s 10.244.180.55 master2 <none> <none>
metrics-server-8b7689b66-z9hk9 1/1 Unknown 0 3m58s 10.244.235.186 worker1 <none> <none>
tiller-deploy-5fd994b8f-twpn2 1/1 Running 0 3h13m 10.244.180.23 master2 <none>
kubectl get apiservice | grep metrics
v1beta1.metrics.k8s.io kube-system/metrics-server False (FailedDiscoveryCheck) 22d
podip 10.244.235.186 failure,podip 10.244.180.55 normal
metrics-server-8b7689b66-xm6mf 1/1 Running 0 36s 10.244.180.55 master2 <none> <none>
metrics-server-8b7689b66-z9hk9 1/1 Unknown 0 3m58s 10.244.235.186 worker1 <none> <none>
kubectl get svc -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
metrics-server ClusterIP 10.101.186.48 <none> 443/TCP 21d
Pointing to the wrong podip 10.244.235.186 conntrack -L | grep 10.101.186.48
tcp 6 278 ESTABLISHED src=10.101.186.48 dst=10.101.186.48 sport=45842 dport=443 src=10.244.235.186 dst=192.168.210.71 sport=443 dport=19158 [ASSURED] mark=0 use=1
tcp 6 298 ESTABLISHED src=10.101.186.48 dst=10.101.186.48 sport=45820 dport=443 src=10.244.235.186 dst=192.168.210.71 sport=443 dport=15276 [ASSURED] mark=0 use=2
ipvsadm -Ln
TCP 10.101.186.48:443 rr
-> 10.244.180.55:443 Masq 1 0 0
-> 10.244.235.186:443 Masq 0 2 0
ipvsadm -lnc | grep 10.101.186.48
TCP 14:58 ESTABLISHED 10.101.186.48:56312 10.101.186.48:443 10.244.235.186:443
TCP 00:18 CLOSE_WAIT 10.101.186.48:56328 10.101.186.48:443 10.244.235.186:443
After 15 minutes, you can connect to the correct pod or restart kube-apiserver
I tried the following method to be effective, modified the kernel from net.ipv4.tcp_retries2=15 to net.ipv4.tcp_retries2=1, in the case of power failure, it can be specified to 10.244.180.55 after 1min release
Use shutdown to shut down the system without this bug Environment:
Kubernetes version (use kubectl version): 1.16.10
Cloud provider or hardware configuration: Vmware
OS (e.g: cat /etc/os-release): centos 7.5
Kernel (e.g. uname -a): 3.10
Install tools: Cloud provider managed service install tools
Network plugin and version (if this is a network-related bug): Calico v3.9.3
Others: None.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (7 by maintainers)
Please upgrade Metrics Server image to v0.4.1 and verify if it’s affected by same problem.