istio: Rolling upgrade breaks sidecar-injector webhook on EKS (failed calling admission webhook "sidecar-injector.istio.io": context deadline exceeded)

Bug description Rolling upgrade of the worker nodes results in brief outage of application pods. There is no documentation which specifies how the HA for istio. which pods need HA etc. We kept Ingress gateway, egress gateway and sidecar with 2 minimum replica set. Do we need pilot to be HA ? Why does the application pods dis appear when a upgrade of worker node happens ?

Expected behavior

Rolling upgrade of the worker nodes should not affect the application pods

Steps to reproduce the bug Ingress gateway, Egress gateway and side car have replica set of 2 rest all istio pods have 1 minimum replica sets.

Version (include the output of istioctl version --remote and kubectl version)

istio 1.0.7 and k8s 1.11 How was Istio installed? Helm Environment where bug was observed (cloud vendor, OS, etc) EKS(AWS) Affected product area (please put an X in all that apply)

[X ] Configuration Infrastructure [ ] Docs [ ] Installation [ ] Networking [ X] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ X] User Experience

Additionally, please consider attaching a cluster state archive by attaching the dump file to this issue.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 8
  • Comments: 56 (16 by maintainers)

Most upvoted comments

I have raised a support case with AWS, will post the findings here once I have an update.

Update: Still waiting for AWS support. They are trying to reproduce the problem at their end.

This appears to be an issue that is possibly specific to EKS…

I have 3 kubernetes clusters:

  1. EKS 1.13
  2. kops 1.13
  3. kops 1.14

In all 3 clusters, I did the following:

  1. Install istio
  2. Enable automatic proxy injection on namespace ‘A’
  3. Apply some workloads into namespace ‘A’
  4. I verified that the proxy sidecar was successfully injected into all workloads in namespace ‘A’
  5. I reduced the number of istio-sidecar-injector pods to 0 by executing this command: k scale -n istio-system deploy istio-sidecar-injector --replicas 0
  6. I deleted all pods in namespace ‘A’
  7. I ran kubectl get pods -n A, nothing was returned
  8. I ran kubectl get rs -n A, all workloads were returned and all displaying that 0 pods are running
  9. I ran kubectl describe rs -n A myrs and see this error: Error creating: Internal error occurred: failed calling webhook "sidecar-injector.istio.io": Post https://istio-sidecar-injector.istio-system.svc:443/inject?timeout=30s: no endpoints available for service "istio-sidecar-injector"
  10. I scaled back up the number of istio-sidecar-injector pods to 1 by executing this command: k scale -n istio-system deploy istio-sidecar-injector --replicas 1
  11. Here different behavior was experienced for kops vs EKS:
    • Kops (both version 1.13/1.14): After about a minute all pods in namespace ‘A’ started up successfully and have the proxy sidecare injected.
    • EKS: After waiting more than 30 minutes no pods in namespace ‘A’ have started.

It appears that the admission controller webhook fails to retry in EKS, but not in kops.

Ran into the same issue with EKS 1.14 and Istio 1.4. Since I’m using the Calico CNI, and not the VPC CNI, seemed like a similar problem experienced with deploying Kubernetes Metrics-Server on EKS, link to issue. Update the istio-sidecar-injector Deployment to include:

spec:
  template:
    spec:
      hostNetwork: true

Once the new Pod is Running, automatic sidecar injection worked for me on a labeled Namespace. Not an ideal solution, using hostNetwork, but it does provide a solution. Comes down to the EKS control plane not being able to communicate with IP addresses outside the VPC, ie IP addresses assigned via Calico CNI.

Here is what I got from Amazon support regarding this issue:

I’ve done some additional testing with both istio/linkerd and EKS/kops and have been able to identify the issue. There is a bug in kubernetes [2] where the admission controller tries to keep a single TCP connection (across which all requests are multiplexed) that is no longer available. If a graceful shutdown takes place where the nodes are cordoned/drained, pods fail over normally. The issue appears to be resolved in kubernetes 1.17.

I have provided the github issues for linkerd [2] and istio [3] which both mention the same issue.

References: [1] https://github.com/kubernetes/kubernetes/issues/80313 [2] https://github.com/linkerd/linkerd2/issues/3606 [3] https://github.com/istio/istio/issues/13840

The root cause is golang doesn’t detect a half-close TCP connection for HTTP2. https://github.com/golang/go/issues/31643

A work-around could be adding env variable name: GODEBUG value: http2server=0 to the inject-sidecar-container deployment spec.

env:
        - name: GODEBUG
          value: http2server=0

For references: https://github.com/golang/net/pull/55 https://github.com/kubernetes/kubernetes/pull/82090 https://github.com/kubernetes/kubernetes/issues/80313

to counter this we split our worker nodes into control plane and workload. that way when we do rolling restart of the control plane nodes (which has istio pods ) the workload pods dont get affected and when we do the rolling restart of workload nodes(our apps) istio is always available.

This has worked pretty well for us

This Istio config appears to have fixed it for me on EKS 1.14:

apiVersion: install.istio.io/v1alpha2
kind: IstioControlPlane
spec:
  autoInjection:
    components:
      injector:
        k8s:
          env:
            - name: GODEBUG
              value: http2server=0

Here is what I got from Amazon support regarding this issue:

I’ve done some additional testing with both istio/linkerd and EKS/kops and have been able to identify the issue. There is a bug in kubernetes [2] where the admission controller tries to keep a single TCP connection (across which all requests are multiplexed) that is no longer available. If a graceful shutdown takes place where the nodes are cordoned/drained, pods fail over normally. The issue appears to be resolved in kubernetes 1.17.

I have provided the github issues for linkerd [2] and istio [3] which both mention the same issue.

References: [1] kubernetes/kubernetes#80313 [2] linkerd/linkerd2#3606 [3] #13840

@jwenz723 Will updating to 1.17 help? have you done it?

As @chadlwilson mentioned, upgrading to 1.17 isn’t an option with EKS. AWS support recommended that I do the same thing that @chadlwilson and @jqmichael are doing with the GODEBUG env var.

hello I’ve been struggling with this issue in EKS for ages, and I’ve tried the following options:

1- Using Weavenet instead of using the AWS CNI: Istio works but the automatic proxy injection does not. Need to create the workload with manual injection of the proxy. 2- Tried the customized variant of AWS CNI as described in this link: https://docs.aws.amazon.com/eks/latest/userguide/cni-custom-network.html

With this configuration, the Istio installation version 1.2.2 just hangs up, waiting for CRDs to be created…

With option 1 at least I solve the IP address exhaustion and have Istio working. Have anyone tried Istio installation with custom AWS CNI?

The root cause is golang doesn’t detect a half-close TCP connection for HTTP2. golang/go#31643

A work-around could be adding env variable name: GODEBUG value: http2server=0 to the inject-sidecar-container deployment spec.

env:
        - name: GODEBUG
          value: http2server=0

I can confirm that @jqmichael 's workaround seems to be working well in my tests. I don’t know what the consequence is of only allowing HTTP/1.1 for the webhook, but it seems to be working well in allowing our cluster to survive uncontrolled shutdown of the nodes the sidecar-injector is on.

If it helps anyone else, we apply this via automated post-Helm-install script: patch-sidecar-injector.yaml

spec:
  template:
    spec:
      containers:
      - name: sidecar-injector-webhook
        env:
        - name: GODEBUG
          value: http2server=0

and

echo "Patching istio-sidecar-injector..."
kubectl patch deployment istio-sidecar-injector -n istio-system --patch "$(cat patch-sidecar-injector.yaml)"
kubectl rollout status deployment istio-sidecar-injector -n istio-system --timeout=180s