kubernetes: AggregationController causing API lag

/kind bug

What happened:

After launching metrics-server my Kube API became unresponsive. with a lot of following in apiserver logs

Nov 27 16:37:06 ns3033879.ip-51-255-71.eu docker[27628]: I1127 15:37:06.669483       1 controller.go:105] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Nov 27 16:37:36 kubemaster docker[27628]: E1127 15:37:36.837123       1 controller.go:111] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: Error: 'dial tcp 10.12.185.191:443: i/o timeout'
Nov 27 16:37:36 kubemaster docker[27628]: Trying to reach: 'https://10.12.185.191:443/swagger.json', Header: map[]
Nov 27 16:37:36 kubemaster docker[27628]: I1127 15:37:36.837147       1 controller.go:119] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.

10.12.185.191 is the ClusterIP of metrics-server service. I assume I made some error bringing metrics-server to the cluster (upgraded from previous versions), but why would it brak apiserver ?

$ time kubectl --namespace kube-system get pod
...
real    8m1.773s
user    0m0.212s
sys     0m0.020s

managed to solve the issue by

kubectl --namespace kube-system delete apiservice v1beta1.metrics.k8s.io

Environment:

  • Kubernetes version (use kubectl version): 1.8.4
  • Cloud provider or hardware configuration: baremetal

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 4
  • Comments: 37 (18 by maintainers)

Most upvoted comments

Adding a note for people who find this issue via Google. Keep in mind that error messages with v1beta1.metrics.k8s.io can be misleading and can be due to multiple underlying causes. That is one reason why this ticket was allowed to become stale and closed. If you run into this error, do a quick sanity check and look for more routine causes of your cluster failure.

For example, just last week my entire cluster failed to come up after a power outage. The first error I found was the familiar error with v1beta1.metrics.k8s.io:

E1016 23:19:50.840814 1 available_controller.go:353] v1beta1.metrics.k8s.io failed with: Get https://10.43.123.100:443: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

The underlying cause was actually that the etcd members couldn’t reach each other due to a firewall issue:

root@docker01:~# docker logs --tail 100 -f etcd
2019-10-17 22:27:11.005468 W | rafthttp: health check for peer asdasdasdasd123 could not connect: dial tcp 192.168.100.100:2380: i/o timeout (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2019-10-17 22:27:11.005497 W | rafthttp: health check for peer 123123123123asd could not connect: dial tcp 192.168.100.101:2380: i/o timeout (prober "ROUND_TRIPPER_RAFT_MESSAGE")

So, as you read the responses in this thread, just be aware that errors with v1beta1.metrics.k8s.io may actually be caused by something else.

I’ve had the same issue with the metrics server slowing down API accesses until I deleted the API service. For what it’s worth, the ultimate cause was that I forgot to include --enable-aggregator-routing=true in the kube-apiserver manifest.

/reopen This is still an issue. v1.13.3. Remove the metrics-server deployment, but not the apiservice to recreate. Here are the full steps:

  1. Apply metrics-server (k8s.gcr.io/metrics-server-amd64:v0.3.1)
  2. Create a namespace
  3. Delete the metrics server deployment only.
  4. Try to delete the namespace It stays there forever in ‘Terminating’ state.

How many replicas of the metrics apiserver are you running? Maybe it just can’t keep up with the load? If we can reproduce this with sample apiserver, that might be good–we could then fix & verify via test that an aggregated apiserver that’s struggling shouldn’t interfere with the rest of the control plane.

On Tue, Jun 5, 2018 at 1:32 PM John Delivuk notifications@github.com wrote:

After enabling the --enable-aggregator-routing=true flag I still continue to continue to see the same behavior. I also notice that this seems to be prevalent when we scale, our prod infrastructure which is much larger suffers, but I don’t notice any issues in our dev, and testing environments.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/56430#issuecomment-394849669, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnglmOlAeNAuaVEhbVKEEqTjfkGekXlks5t5urCgaJpZM4Qr4HD .

After enabling the --enable-aggregator-routing=true flag I still continue to continue to see the same behavior. I also notice that this seems to be prevalent when we scale, our prod infrastructure which is much larger suffers, but I don’t notice any issues in our dev, and testing environments.

Also I tracked my kubectl commands in the api server log, and the api server reports the calls finish in a few hundred milliseconds. While the client reports up to 8 seconds. Hope this helps.

/remove-lifecycle stale

Still an issue with k8s 1.11.1 - this ate 2 of my days, kubectl requests taking minutes. Seems like it started after changing /etc/resolv.conf (making the node unable to resolve for a while).

Mentioned hotfix worked: kubectl --namespace kube-system delete apiservice v1beta1.metrics.k8s.io

Environment:

Kubernetes version: v1.11.1 (server) Cloud provider or hardware configuration: baremetal (Ubuntu bionic) Installer: RKE

I suspected that but what should be hitting it? I had no HPA’s enabled. I did scale up to 2 replicas just to see, but not effect.