kubernetes: AggregationController causing API lag
/kind bug
What happened:
After launching metrics-server my Kube API became unresponsive. with a lot of following in apiserver logs
Nov 27 16:37:06 ns3033879.ip-51-255-71.eu docker[27628]: I1127 15:37:06.669483 1 controller.go:105] OpenAPI AggregationController: Processing item v1beta1.metrics.k8s.io
Nov 27 16:37:36 kubemaster docker[27628]: E1127 15:37:36.837123 1 controller.go:111] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: Error: 'dial tcp 10.12.185.191:443: i/o timeout'
Nov 27 16:37:36 kubemaster docker[27628]: Trying to reach: 'https://10.12.185.191:443/swagger.json', Header: map[]
Nov 27 16:37:36 kubemaster docker[27628]: I1127 15:37:36.837147 1 controller.go:119] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
10.12.185.191 is the ClusterIP of metrics-server service. I assume I made some error bringing metrics-server to the cluster (upgraded from previous versions), but why would it brak apiserver ?
$ time kubectl --namespace kube-system get pod
...
real 8m1.773s
user 0m0.212s
sys 0m0.020s
managed to solve the issue by
kubectl --namespace kube-system delete apiservice v1beta1.metrics.k8s.io
Environment:
- Kubernetes version (use
kubectl version): 1.8.4 - Cloud provider or hardware configuration: baremetal
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 4
- Comments: 37 (18 by maintainers)
Adding a note for people who find this issue via Google. Keep in mind that error messages with
v1beta1.metrics.k8s.iocan be misleading and can be due to multiple underlying causes. That is one reason why this ticket was allowed to become stale and closed. If you run into this error, do a quick sanity check and look for more routine causes of your cluster failure.For example, just last week my entire cluster failed to come up after a power outage. The first error I found was the familiar error with
v1beta1.metrics.k8s.io:E1016 23:19:50.840814 1 available_controller.go:353] v1beta1.metrics.k8s.io failed with: Get https://10.43.123.100:443: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)The underlying cause was actually that the
etcdmembers couldn’t reach each other due to a firewall issue:So, as you read the responses in this thread, just be aware that errors with
v1beta1.metrics.k8s.iomay actually be caused by something else.I’ve had the same issue with the metrics server slowing down API accesses until I deleted the API service. For what it’s worth, the ultimate cause was that I forgot to include
--enable-aggregator-routing=truein the kube-apiserver manifest./reopen This is still an issue. v1.13.3. Remove the metrics-server deployment, but not the apiservice to recreate. Here are the full steps:
How many replicas of the metrics apiserver are you running? Maybe it just can’t keep up with the load? If we can reproduce this with sample apiserver, that might be good–we could then fix & verify via test that an aggregated apiserver that’s struggling shouldn’t interfere with the rest of the control plane.
On Tue, Jun 5, 2018 at 1:32 PM John Delivuk notifications@github.com wrote:
After enabling the
--enable-aggregator-routing=trueflag I still continue to continue to see the same behavior. I also notice that this seems to be prevalent when we scale, our prod infrastructure which is much larger suffers, but I don’t notice any issues in our dev, and testing environments.Also I tracked my kubectl commands in the api server log, and the api server reports the calls finish in a few hundred milliseconds. While the client reports up to 8 seconds. Hope this helps.
/remove-lifecycle stale
Still an issue with k8s 1.11.1 - this ate 2 of my days, kubectl requests taking minutes. Seems like it started after changing /etc/resolv.conf (making the node unable to resolve for a while).
Mentioned hotfix worked:
kubectl --namespace kube-system delete apiservice v1beta1.metrics.k8s.ioEnvironment:
Kubernetes version: v1.11.1 (server) Cloud provider or hardware configuration: baremetal (Ubuntu bionic) Installer: RKE
I suspected that but what should be hitting it? I had no HPA’s enabled. I did scale up to 2 replicas just to see, but not effect.