istio: Connection problems with large deployments (1200+ pods)
Updated Issue (12/30/18)
(Issue has been updated to reflect the current understanding of this problem, which is much improved since it was originally opened. See 10/1/18 edit in history for original text.)
Describe the bug Deployments become unstable as they grow in size with sidecars deployed. This manifests as one of the following 3 errors:
Readiness probe failed: HTTP probe failed with statuscode: 503
Liveness probe failed: HTTP probe failed with statuscode: 503
Liveness probe failed: Get http://10.41.128.12:9080/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
This appears to only happen on very large Deployment
(1200+ pods), but it is regardless of instance count or size available in the cluster. I have tried with 15x m4.large, 20x m4.xlarge, 20x r4.xlarge, etc. The threshold for failure does not change. The other key point here is that the issue does not occur if there is no k8s Service
on the deployment.
Expected behavior Large Deployments do not have problems handling requests
Steps to reproduce the bug
- Have a cluster that should easily handle 1500 pods (15-20 m4.xlarge is fine)
- Install istio with the following values.yaml:
global:
proxy:
# Disable Mixer logging every request
accessLogFile: ""
mixer:
replicaCount: 3
autoscaleMin: 3
autoscaleMax: 10
pilot:
replicaCount: 3
autoscaleMin: 3
autoscaleMax: 10
- Label the
default
namespace withistio-injection=enabled
- Create a simple http-listener deployment/service; I’ll use bookinfo to keep it simple. a. This should have 100 replicas, liveness/readiness probes, and set the resource claims to 0 to make things simple to test
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: productpage-v1
spec:
replicas: 100
template:
metadata:
labels:
app: productpage
version: v1
spec:
containers:
- name: productpage
image: istio/examples-bookinfo-productpage-v1:1.8.0
imagePullPolicy: IfNotPresent
resources:
requests:
cpu: 0
memory: 0
ports:
- containerPort: 9080
livenessProbe: &probe
initialDelaySeconds: 10
httpGet:
path: /
port: 9080
readinessProbe:
<<: *probe
---
apiVersion: v1
kind: Service
metadata:
name: productpage-v1
labels:
app: productpage-v1
spec:
ports:
- port: 9080
name: http
selector:
app: productpage
version: v1
- Apply the manifest to create the deployment and service
- Wait for the pods to become
Running
and passing health checks - Increase the deployments by 100 and apply changes; repeat until 1200-1500 where you will see health checks start to fail on both old and new pods.
I want to go ahead and note here that separating the deployments into 100-200 pod chunks (I went to 9 deployments of 200 each for 1800 pods) seems to address the issue.
Version Istio: 1.0.5 Kubernetes: 1.10.11
Environment
- New kops-built clusters to validate this issue on AWS (no other traffic) with Weave overlay
- Various instance sizes up to 20x r4.xlarge
- mTLS disabled
- All pods have
httpGet
liveness/readiness probes
** Other Notes**
- Liveness/Readiness probes are a way to see the problem reliably, but I imagine it would be similar for end-users hitting a service
- Removing Istio addresses stability problems
- Removing the
Service
from the deployments addresses stability problems - Running 18 separate deployments of 100 pods each addresses stability problems (though I didn’t test a shared service on all of them)
- Once a cluster gets into a busted state from a large deployment, scaling down won’t fix it until I restart the masters
- Toying with the mixer HPA numbers does seem to alleviate the issue – perhaps a larger deployment is heavier on mixer? With all the moving parts, I haven’t nailed down a good test here… just speculation.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 6
- Comments: 29 (27 by maintainers)
We have experienced a similar issue on our staging environment recently with Istio 1.0.1. We have a fairly large cluster of around 1000 pods and are not running mtls. A couple of weeks ago we enabled Istio auto injection in the cluster and therefore as pod deploys occurred the mesh organically grew from a handful of pods with sidecars to nearly all of the pods in the cluster. Everything was running fine until a couple of days ago where suddenly any new pod deploy would fail it’s liveness probe with a connection refused. e.g.
dial tcp 100.64.10.42:8080: getsockopt: connection refused
Once this issue had started to occur we observed the following:
Action taken:
autoInject: disabled
) and recreated the affected pods without Istio sidecars.On the face of it this seems like some sort of capacity issue, as everything was running smoothly whilst ramping up the amount of pods with Istio sidecars until a certain level at which point everything started to fail. Indeed now we are back to the point where only the handful of pods that opt-in for an Istio sidecar are injected and they are all running fine with none of these issues.
Last tests from Mandar seem to confirm this is no longer a problem, as long as Isolation (Sidecar crd) is used.
He means the new
Sidecar
resource, and accompanyingexport_to
fields, which lets you restrict what config gets sent to each set of sidecars (that link is the mesh wide default, similar fields exist on each resource: VirtualService, DestinationRule, ServiceEntry). They’re being added in 1.1 and are available in the pre-release. See the docs (I’ll check and see why theexport_to
fields are still hidden), but to help your scale problem:Write a
Sidecar
for each namespace in your deployment. In each namespace, include services (or whole namespaces) you need to call out to in theegress
clause.Sidecar
s are opt-in per namespace, so just setting it up for the few large apps may be enough. However, if large sets of services need to communicate there may still be issues.After creating
Sidecar
s, set thedefaultServiceExportTo
of the fieldMeshConfig
to"."
make service namespace-private by default. Then they’ll only be visible by configuring them via a sidecar.Order is important here: sidecars will not be able to route traffic to services that are not visible, so setting the default before creating
Sidecar
s for each namespace will break cross-namespace traffic in apps with Istio sidecars deployed.@costinm Can you clarify what “sidecar is in” means here? From context, this sounds really promising though!
@jaygorrell We’ve been refocused on other projects so we haven’t revisited this in anger. I was initially thinking that we would hold out for the 1.1.0 release, however it looks like 1.0.5 is worth trying. I’ll be sure to update you if we give it a go.