istio: Rolling upgrade breaks sidecar-injector webhook on EKS (failed calling admission webhook "sidecar-injector.istio.io": context deadline exceeded)
Bug description Rolling upgrade of the worker nodes results in brief outage of application pods. There is no documentation which specifies how the HA for istio. which pods need HA etc. We kept Ingress gateway, egress gateway and sidecar with 2 minimum replica set. Do we need pilot to be HA ? Why does the application pods dis appear when a upgrade of worker node happens ?
Expected behavior
Rolling upgrade of the worker nodes should not affect the application pods
Steps to reproduce the bug Ingress gateway, Egress gateway and side car have replica set of 2 rest all istio pods have 1 minimum replica sets.
Version (include the output of istioctl version --remote
and kubectl version
)
istio 1.0.7 and k8s 1.11 How was Istio installed? Helm Environment where bug was observed (cloud vendor, OS, etc) EKS(AWS) Affected product area (please put an X in all that apply)
[X ] Configuration Infrastructure [ ] Docs [ ] Installation [ ] Networking [ X] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ X] User Experience
Additionally, please consider attaching a cluster state archive by attaching the dump file to this issue.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 8
- Comments: 56 (16 by maintainers)
I have raised a support case with AWS, will post the findings here once I have an update.
Update: Still waiting for AWS support. They are trying to reproduce the problem at their end.
This appears to be an issue that is possibly specific to EKS…
I have 3 kubernetes clusters:
In all 3 clusters, I did the following:
istio-sidecar-injector
pods to 0 by executing this command:k scale -n istio-system deploy istio-sidecar-injector --replicas 0
kubectl get pods -n A
, nothing was returnedkubectl get rs -n A
, all workloads were returned and all displaying that 0 pods are runningkubectl describe rs -n A myrs
and see this error:Error creating: Internal error occurred: failed calling webhook "sidecar-injector.istio.io": Post https://istio-sidecar-injector.istio-system.svc:443/inject?timeout=30s: no endpoints available for service "istio-sidecar-injector"
istio-sidecar-injector
pods to 1 by executing this command:k scale -n istio-system deploy istio-sidecar-injector --replicas 1
It appears that the admission controller webhook fails to retry in EKS, but not in kops.
Ran into the same issue with EKS 1.14 and Istio 1.4. Since I’m using the Calico CNI, and not the VPC CNI, seemed like a similar problem experienced with deploying Kubernetes Metrics-Server on EKS, link to issue. Update the istio-sidecar-injector Deployment to include:
Once the new Pod is Running, automatic sidecar injection worked for me on a labeled Namespace. Not an ideal solution, using hostNetwork, but it does provide a solution. Comes down to the EKS control plane not being able to communicate with IP addresses outside the VPC, ie IP addresses assigned via Calico CNI.
Here is what I got from Amazon support regarding this issue:
The root cause is golang doesn’t detect a half-close TCP connection for HTTP2. https://github.com/golang/go/issues/31643
A work-around could be adding env variable name: GODEBUG value: http2server=0 to the inject-sidecar-container deployment spec.
For references: https://github.com/golang/net/pull/55 https://github.com/kubernetes/kubernetes/pull/82090 https://github.com/kubernetes/kubernetes/issues/80313
to counter this we split our worker nodes into control plane and workload. that way when we do rolling restart of the control plane nodes (which has istio pods ) the workload pods dont get affected and when we do the rolling restart of workload nodes(our apps) istio is always available.
This has worked pretty well for us
This Istio config appears to have fixed it for me on EKS 1.14:
As @chadlwilson mentioned, upgrading to 1.17 isn’t an option with EKS. AWS support recommended that I do the same thing that @chadlwilson and @jqmichael are doing with the
GODEBUG
env var.hello I’ve been struggling with this issue in EKS for ages, and I’ve tried the following options:
1- Using Weavenet instead of using the AWS CNI: Istio works but the automatic proxy injection does not. Need to create the workload with manual injection of the proxy. 2- Tried the customized variant of AWS CNI as described in this link: https://docs.aws.amazon.com/eks/latest/userguide/cni-custom-network.html
With this configuration, the Istio installation version 1.2.2 just hangs up, waiting for CRDs to be created…
With option 1 at least I solve the IP address exhaustion and have Istio working. Have anyone tried Istio installation with custom AWS CNI?
I can confirm that @jqmichael 's workaround seems to be working well in my tests. I don’t know what the consequence is of only allowing HTTP/1.1 for the webhook, but it seems to be working well in allowing our cluster to survive uncontrolled shutdown of the nodes the sidecar-injector is on.
If it helps anyone else, we apply this via automated post-Helm-install script: patch-sidecar-injector.yaml
and