kubernetes: [Flaking Test] metrics-server not starting in BeforeSuite (ci-kubernetes-e2e-ubuntu-gce)

Which jobs are flaking:

ci-kubernetes-e2e-ubuntu-gce (gce-ubuntu-master-default)

Which test(s) are flaking:

  • Kubernetes e2e suite: BeforeSuite
  • (There are other flakes in this job, but they are failing less often and different test each time, so let’s say this issue is only for BeforeSuite)

Testgrid link:

https://testgrid.k8s.io/sig-release-master-informing#gce-ubuntu-master-default&width=20

Reason for failure:

metrics-server is not starting:

_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/e2e.go:76
Jul 15 04:34:25.942: Error waiting for all pods to be running and ready: 1 / 31 pods in namespace "kube-system" are NOT in RUNNING and READY state in 10m0s
POD                                    NODE                            PHASE   GRACE CONDITIONS
metrics-server-v0.4.4-6c6b749986-v4wv9 bootstrap-e2e-minion-group-vdwd Running       [{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-15 04:22:49 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-15 04:29:57 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [metrics-server]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-15 04:29:57 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [metrics-server]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-07-15 04:22:49 +0000 UTC Reason: Message:}]

_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/e2e.go:79

Anything else we need to know:

Spyglass: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce/1415525375479386112 Triage: https://storage.googleapis.com/k8s-gubernator/triage/index.html?job=ci-kubernetes-e2e-ubuntu-gce

Mentioned in https://github.com/kubernetes/kubernetes/issues/102101#issuecomment-879622365

/cc @aojea

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (16 by maintainers)

Most upvoted comments

As far as I can tell, metrics-server is never starting properly, even when the jobs are successful.

If we were to take this job, that has succeeded: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce/1411948153548050432

From the serial logs, we can see that the metrics-server pod is crash looping: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce/1411948153548050432/artifacts/bootstrap-e2e-minion-group-cwh0/serial-1.log

But from the build logs, we can see that at some point the pod was deemed ready: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce/1411948153548050432/build-log.txt

I0705 07:31:57.450] Jul  5 07:31:54.952: INFO: The status of Pod metrics-server-v0.4.4-6fccfbd69f-pxwp8 is Running (Ready = false), waiting for it to be either Running (with Ready = true) or Failed
I0705 07:31:57.450] Jul  5 07:31:54.952: INFO: 30 / 31 pods in namespace 'kube-system' are running and ready (172 seconds elapsed)
I0705 07:31:57.451] Jul  5 07:31:54.952: INFO: expected 5 pod replicas in namespace 'kube-system', 4 are Running and Ready.
I0705 07:31:57.451] Jul  5 07:31:54.952: INFO: POD                                     NODE                             PHASE    GRACE  CONDITIONS
I0705 07:31:57.452] Jul  5 07:31:54.952: INFO: metrics-server-v0.4.4-6fccfbd69f-pxwp8  bootstrap-e2e-minion-group-cwh0  Running         [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-07-05 07:27:52 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2021-07-05 07:30:24 +0000 UTC ContainersNotReady containers with unready status: [metrics-server]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2021-07-05 07:30:24 +0000 UTC ContainersNotReady containers with unready status: [metrics-server]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2021-07-05 07:27:52 +0000 UTC  }]
I0705 07:31:57.452] Jul  5 07:31:54.952: INFO: 
I0705 07:31:57.453] Jul  5 07:31:56.991: INFO: 31 / 31 pods in namespace 'kube-system' are running and ready (174 seconds elapsed)
I0705 07:31:57.453] Jul  5 07:31:56.991: INFO: expected 5 pod replicas in namespace 'kube-system', 5 are Running and Ready.

To me it seems that there are 2 issues, one being that metrics-server doesn’t have the capabilities it needs to run and another one that still needs to be investigated where metrics-server is sometimes marked ready even though it is in a crash looping state.

The metrics-server fails to bind ~because there is another process listening~ on port 443 https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce/1415525375479386112/build-log.txt

I0715 04:34:26.680] Latency metrics for node bootstrap-e2e-minion-group-vdwd
I0715 04:34:26.680] Jul 15 04:34:25.555: INFO: Running kubectl logs on non-ready containers in kube-system
I0715 04:34:26.680] Jul 15 04:34:25.745: INFO: Logs of kube-system/metrics-server-v0.4.4-6c6b749986-v4wv9:metrics-server on node bootstrap-e2e-minion-group-vdwd
I0715 04:34:26.680] Jul 15 04:34:25.745: INFO:  : STARTLOG
I0715 04:34:26.681] Flag --deprecated-kubelet-completely-insecure has been deprecated, This is rarely the right option, since it leaves kubelet communication completely insecure.  If you encounter auth errors, make sure you've enabled token webhook auth on the Kubelet, and if you're in a test cluster with self-signed Kubelet certificates, consider using kubelet-insecure-tls instead.
I0715 04:34:26.681] Error: failed to create listener: failed to listen on 0.0.0.0:443: listen tcp 0.0.0.0:443: bind: permission deni

https://github.com/kubernetes/kubernetes/issues/102101#issuecomment-879622365

@cheftako are some of the konnectivity pods listening on the port :443?

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-ubuntu-gce/1415525375479386112/artifacts/bootstrap-e2e-master/konnectivity-server.log

E0715 04:23:07.629945       1 server.go:697] "DIAL_RSP contains failure" err="dial tcp 10.64.1.5:443: connect: connection refused"

Yeah, Maybe you can refer to this, the release notes about v0.4.4 metrics-server . I think you have resolved this problem.

https://github.com/kubernetes-sigs/metrics-server/releases

Installation
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.4.4/components.yaml
WARNING - To allow binding privileged ports image now requires NET_BIND_SERVICE capability. If you are using a security context that has all capabilities dropped, such as from the original stable Helm chart, you will need to use a less restrictive policy.

Currently, we are using metric server 0.4.4 and the latest 0.5.0 is released. Are there some known bugs that may be related to this test failure?

I don’t think so. /cc @dgrisonnet @yangjunmyfm192085