istio: Connection problems with large deployments (1200+ pods)

Updated Issue (12/30/18)

(Issue has been updated to reflect the current understanding of this problem, which is much improved since it was originally opened. See 10/1/18 edit in history for original text.)

Describe the bug Deployments become unstable as they grow in size with sidecars deployed. This manifests as one of the following 3 errors:

Readiness probe failed: HTTP probe failed with statuscode: 503
Liveness probe failed: HTTP probe failed with statuscode: 503
Liveness probe failed: Get http://10.41.128.12:9080/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

This appears to only happen on very large Deployment (1200+ pods), but it is regardless of instance count or size available in the cluster. I have tried with 15x m4.large, 20x m4.xlarge, 20x r4.xlarge, etc. The threshold for failure does not change. The other key point here is that the issue does not occur if there is no k8s Service on the deployment.

Expected behavior Large Deployments do not have problems handling requests

Steps to reproduce the bug

  1. Have a cluster that should easily handle 1500 pods (15-20 m4.xlarge is fine)
  2. Install istio with the following values.yaml:
global:
  proxy:
    # Disable Mixer logging every request
    accessLogFile: ""
mixer:
  replicaCount: 3
  autoscaleMin: 3
  autoscaleMax: 10
pilot:
  replicaCount: 3
  autoscaleMin: 3
  autoscaleMax: 10
  1. Label the default namespace with istio-injection=enabled
  2. Create a simple http-listener deployment/service; I’ll use bookinfo to keep it simple. a. This should have 100 replicas, liveness/readiness probes, and set the resource claims to 0 to make things simple to test
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: productpage-v1
spec:
  replicas: 100
  template:
    metadata:
      labels:
        app: productpage
        version: v1
    spec:
      containers:
      - name: productpage
        image: istio/examples-bookinfo-productpage-v1:1.8.0
        imagePullPolicy: IfNotPresent
        resources:
          requests:
            cpu: 0
            memory: 0
        ports:
        - containerPort: 9080
        livenessProbe: &probe
          initialDelaySeconds: 10
          httpGet:
            path: /
            port: 9080
        readinessProbe:
          <<: *probe
---
apiVersion: v1
kind: Service
metadata:
  name: productpage-v1
  labels:
    app: productpage-v1
spec:
  ports:
  - port: 9080
    name: http
  selector:
    app: productpage
    version: v1
  1. Apply the manifest to create the deployment and service
  2. Wait for the pods to become Running and passing health checks
  3. Increase the deployments by 100 and apply changes; repeat until 1200-1500 where you will see health checks start to fail on both old and new pods.

I want to go ahead and note here that separating the deployments into 100-200 pod chunks (I went to 9 deployments of 200 each for 1800 pods) seems to address the issue.

Version Istio: 1.0.5 Kubernetes: 1.10.11

Environment

  • New kops-built clusters to validate this issue on AWS (no other traffic) with Weave overlay
  • Various instance sizes up to 20x r4.xlarge
  • mTLS disabled
  • All pods have httpGet liveness/readiness probes

** Other Notes**

  • Liveness/Readiness probes are a way to see the problem reliably, but I imagine it would be similar for end-users hitting a service
  • Removing Istio addresses stability problems
  • Removing the Service from the deployments addresses stability problems
  • Running 18 separate deployments of 100 pods each addresses stability problems (though I didn’t test a shared service on all of them)
  • Once a cluster gets into a busted state from a large deployment, scaling down won’t fix it until I restart the masters
  • Toying with the mixer HPA numbers does seem to alleviate the issue – perhaps a larger deployment is heavier on mixer? With all the moving parts, I haven’t nailed down a good test here… just speculation.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 6
  • Comments: 29 (27 by maintainers)

Most upvoted comments

We have experienced a similar issue on our staging environment recently with Istio 1.0.1. We have a fairly large cluster of around 1000 pods and are not running mtls. A couple of weeks ago we enabled Istio auto injection in the cluster and therefore as pod deploys occurred the mesh organically grew from a handful of pods with sidecars to nearly all of the pods in the cluster. Everything was running fine until a couple of days ago where suddenly any new pod deploy would fail it’s liveness probe with a connection refused. e.g. dial tcp 100.64.10.42:8080: getsockopt: connection refused

Once this issue had started to occur we observed the following:

  • in all of the cases we saw, all pods being created with an Istio sidecar would fail their liveness check on startup with a connection refused
  • some pods already running with Istio sidecars would also fail their liveness & readiness probes with connection refused errors. Within the set of pods with Istio sidecars, there seemed to be a marked divide between pods that experienced issues (mulitple restarts, always failing) and others that experienced zero errors and were unaffected.
  • pods running without Istio sidecars seemed completely unaffected and could be connected to fine.

Action taken:

  • We checked resource consumption and no nodes in the cluster were running hot in any way. Also there didn’t seem to be a pattern between affected pods and where they were located.
  • We removed non-essential Istio components from our cluster (essentially all the metrics stuff) such as mixer, prometheus etc, but this had no effect.
  • As any new deployment would fail we had to act fairly quickly to resolve the issue; as such we put Istio back into an opt-in state (i.e. autoInject: disabled) and recreated the affected pods without Istio sidecars.

On the face of it this seems like some sort of capacity issue, as everything was running smoothly whilst ramping up the amount of pods with Istio sidecars until a certain level at which point everything started to fail. Indeed now we are back to the point where only the handful of pods that opt-in for an Istio sidecar are injected and they are all running fine with none of these issues.

Last tests from Mandar seem to confirm this is no longer a problem, as long as Isolation (Sidecar crd) is used.

He means the new Sidecar resource, and accompanying export_to fields, which lets you restrict what config gets sent to each set of sidecars (that link is the mesh wide default, similar fields exist on each resource: VirtualService, DestinationRule, ServiceEntry). They’re being added in 1.1 and are available in the pre-release. See the docs (I’ll check and see why the export_to fields are still hidden), but to help your scale problem:

  • Write a Sidecar for each namespace in your deployment. In each namespace, include services (or whole namespaces) you need to call out to in the egress clause.

    # This restricts the clusters and endpoints sent to Envoys in the `foo` namespace
    # to just other services in the same namespace, every service in the `baz` namespace,
    # and the `bang` service from the `bar` namespace.
    apiVersion: networking.istio.io/v1alpha3
    kind: Sidecar
    metadata:
      name: default
      namespace: foo
    spec:
      egress:
      - hosts:
        - "bar/bang.bar.svc.cluster.local"
        - "baz/*"
    

    Sidecars are opt-in per namespace, so just setting it up for the few large apps may be enough. However, if large sets of services need to communicate there may still be issues.

  • After creating Sidecars, set the defaultServiceExportTo of the field MeshConfig to "." make service namespace-private by default. Then they’ll only be visible by configuring them via a sidecar.

Order is important here: sidecars will not be able to route traffic to services that are not visible, so setting the default before creating Sidecars for each namespace will break cross-namespace traffic in apps with Istio sidecars deployed.

@costinm Can you clarify what “sidecar is in” means here? From context, this sounds really promising though!

@jaygorrell We’ve been refocused on other projects so we haven’t revisited this in anger. I was initially thinking that we would hold out for the 1.1.0 release, however it looks like 1.0.5 is worth trying. I’ll be sure to update you if we give it a go.