istio: K8S Dashboard loads slowly, tiller unresponsive once istio 1.0.0 installed
Describe the bug On a new Azure Container Service (AKS) cluster with 1.10.6 – as soon as I install istio 1.0.0 (I’ve tried the official release and daily istio-release-1.0-20180803-09-15) – requests in the K8S dashboard take 5-10 seconds or timeout completely. Additionally commands to tiller timeout retrieving configmaps.
All kubectl commands I can think to run succeed and run quickly. Installing istio 0.8 does not have this issue.
Expected behavior No negative impact to other services when installing istio.
Steps to reproduce the bug
- Create new AKS cluster.
- Install istio… I used the following helm command (and corresponding kubectl apply):
helm template install/kubernetes/helm/istio --name istio --set servicegraph.enabled=true --set grafana.enabled=true --set tracing.enabled=true --set galley.enabled=false --set telemetry-gateway.grafanaEnabled=true --set telemetry-gateway.prometheusEnabled=true --namespace istio-system
- Wait a few minutes for the various pods to start up.
- Run kubectl proxy (or az aks browse) and try to navigate in the dashboard. Or run
helm ls
.
Version Istio: release-1.0-20180803-09-15 K8S: 1.10.6
Is Istio Auth enabled or not? No
Environment Azure AKS
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 2
- Comments: 57 (23 by maintainers)
Thanks @BernhardRode - that helps possibly eliminate OOM problems. I’m on PTO until the 10th, just thought I’d offer some quick help here, but I don’t have time at this immediate moment to spin up AKS. Once PTO finishes up, will have time.
Sounds like a common problem people are suffering with.
Related - https://github.com/Azure/AKS/issues/620
@rsnj It appears, at the moment, that on AKS, you have to chose between policy or telemetry. If you aren’t enforcing any policies in the Mixer layer (rate limits, whitelists, etc.), then I would recommend prioritizing telemetry (but that’s the part of the system I spend the most time on, so I may be slightly biased). Istio RBAC currently does not require Mixer, so you’ll still have some functionality policy-wise.
To be successful without
istio-policy
running, you’ll need to turn off check calls (otherwise you’ll get connectivity issues as requests are denied because the proxy cannot reach the policy service). To do that, you need to install Istio withglobal.disablePolicyChecks
set totrue
. I haven’t spent much time trying this out, but I know that others have done this, so if this is of interest, I’m sure we can get this working. Istio is working on documentation for piecemeal installs. This would be a good test case.In the slightly longer term, Mixer should reduce the number of CRDs down to 3, which should should help reduce burden on the API Server. Sometime after that, Mixer will receive config directly from Galley, reducing the burden even further.
Does that help?
@fhoy @douglas-reid I just updated my existing PR for the helm chart to include the switch for
useAdapterCRDs
https://github.com/istio/istio/pull/9435/files@douglas-reid @rsnj I appreciate the heads up. I’m investigating and will report back.
@rsnj this seems like an issue with the resources given to the API Server. I’d suggest trying the experiment in reverse (delete both, then add back
istio-telemetry
and then, after testing, add backistio-policy
).Mixer (which backs both
istio-policy
andistio-telemetry
) opens a fair number of watches (~40) on CRDs and otherwise. I suspect that the API Server in these clusters is just not setup to handle this.If there Azure Support has any information on how to increase resources for the API Server, that’d be the best way to resolve the issue. Maybe @lachie83 has some ideas (or contacts that do) here ?
@douglas-reid I’m using a brand new AKS Cluster running Kubernetes 1.11.2 that has no load on it. I installed Istio via helm using the default settings. I then deployed a simple service and connected it to a gateway. After the services deployed the entire system went to into deadlock and the istio-policy and istio-telemetry started using more and more CPU until they replicated and the second replica just went into CrashLoopBackOff. My service was never accessible.
Looking at the logs, I can see my services deployed and then its just a steady stream of the same error coming from istio-mixer and istio-pilot.
There are thousands of errors just like these:
To report back… I deployed a new cluster with
useAdapterCRDs=false
and its been running for ~10 days or so now without a recurrence of the watch issue slowing helm/dashboard.Good work! It’d be great if we could get the helm option @douglas-reid mentioned so my scripts can stop hacking the subchart in 1.1 releases. Can’t find the PR mentioned though.
Just an update I left the cluster running for about 8 hours without
istio-telemetry
running. Theistio-policy
pod autoscaled to 5 instances that were all in a CrashLoopBackOff state and my entire cluster went down again. The cluster has zero load on it and only has a simple web service running without any external dependencies.@douglas-reid I will try your suggestion next to enable telemetry and disable policy.
@rsnj Wow:
Failed to list *v1.Pod: the server was unable to return a response in the time allotted, but may still be processing the request (get pods)
. That is not good. I wonder why everything with the API Server is so slow.Ran the same daily (
release-1.0-20180822-09-15
) overnight on AKS (Istio installed via Helm with no options) and I also put in a couple of test services. There is no load on the cluster, no one is using it. As @rsnj reported, telemetry and policy are having a bad time:I was using
istio-release-1.0-20180820-09-15
and Galley was crashing, so the problem seemed to move around (See #7586).I also installed the latest daily
istio-release-1.0-20180822-09-15
build on my AKS cluster. Everything was running smoothly for a bit so I deployed a simple application with a gateway configuration and then I noticed theistio-telemetry
andistio-policy
pods using a lot of CPU. When they autoscaled to 2 replicas their replicas went in to a CrashLoopBackOff state with the error:Liveness probe failed: Get http://10.200.0.90:9093/version: dial tcp 10.200.0.90:9093: connect: connection refused
Looking at my logs there are a lot of these errors:
Failed to list *v1beta2.ReplicaSet: the server was unable to return a response in the time allotted, but may still be processing the request (get replicasets.apps)
and thesegc 233 @3240.155s 0%: 0.044+5.9+7.6 ms clock, 0.089+0.24/2.8/8.2+15 ms cpu, 15->15->7 MB, 16 MB goal, 2 P\n
My Kubernetes Dashboard is still unresponsive, but Helm is working. Microsoft has been responsive to me through the Azure support channel, but they are out of ways to troubleshoot the issue.
@CapTaek radio silence from AKS-Help so far.
Installed the latest daily on a new cluster (with galley enabled this time). Nothing crashing/restarting but still the same problems with K8S Dashboard performance with istio installed/no issues with it removed.
Just got this from Azure Support:
I just tried to reconnect to the cluster and the issue is still there 😦
istio-galley is crashing all the time.
Pods
I ran your commands on bare metal. Note I don’t immediately have access to AKS. I suspect you are are in an OOM situation where the kernel continually kills processes and Kubernetes continually restarts them (hence the helm version/helm ls lag, and dashboard lag). This is hard to detect - but can be seen with kubectl describe on a restarted pod (grep for OOM).
Also, a namespace was not created for istio-system above. Are you executing an upgrade, or a fresh install? I suspect an upgrade will require more memory. Please reference the documentation for installation instructions here:
https://istio.io/docs/setup/kubernetes/helm-install/#option-1-install-with-helm-via-helm-template
and for Azure platform setup here:
https://istio.io/docs/setup/kubernetes/platform-setup/azure/
Note I have not personally validated the Azure platform setup instructions.
You can see from my AIO workflow below, that a very bare bones Ubuntu 16.04.04 bare metal system requires 13 GB of ram for Kubernetes + Istio. Reading the azure documentation on isito.io, you might try increasing the node count beyond 3 nodes, to provide yourself more memory for the cluster to work with. It also took around 6 minutes to deploy Kubernetes and Istio on my bare metal system (which is a beast of a server). You mentioned you waited a few minutes - this may not be sufficient for Istio to initialize.