istio: Stale endpoints may cause istio-pilot to be stuck
Bug description
We are running 1.4.5-10.
We have a service that has 300+ replicas, and in the process of doing a rolling update, we would see istio-proxy to not come up with proper ingress/egress port bound as specified in the matching sidecar object.
Looking at the pilot log, right after the rolling update we will start seeing these in the log
2020-06-29T19:34:26.897583Z warn failed to get pod web/noncanary-69bc8b644d-swk5v from kube-apiserver: pods "noncanary-69bc8b644d-swk5v" not found
2020-06-29T19:34:26.897612Z error Endpoint without pod 10.20.214.31 noncanary.web
2020-06-29T19:34:27.114441Z warn failed to get pod web/noncanary-69bc8b644d-2hs9d from kube-apiserver: pods "noncanary-69bc8b644d-2hs9d" not found
2020-06-29T19:34:27.114470Z error Endpoint without pod 10.20.214.36 noncanary.web
2020-06-29T19:34:27.303719Z warn failed to get pod web/noncanary-69bc8b644d-cvnsx from kube-apiserver: pods "noncanary-69bc8b644d-cvnsx" not found
2020-06-29T19:34:27.303749Z error Endpoint without pod 10.20.214.4 noncanary.web
2020-06-29T19:34:27.497450Z warn failed to get pod web/noncanary-69bc8b644d-bsvmp from kube-apiserver: pods "noncanary-69bc8b644d-bsvmp" not found
2020-06-29T19:34:27.497483Z error Endpoint without pod 10.20.215.123 noncanary.web
2020-06-29T19:34:27.707925Z warn failed to get pod web/noncanary-69bc8b644d-f85fl from kube-apiserver: pods "noncanary-69bc8b644d-f85fl" not found
2020-06-29T19:34:27.707962Z error Endpoint without pod 10.20.215.222 noncanary.web
And these errors repeat for the same endpoint
2020-06-29T19:35:11.240233Z warn failed to get pod web/noncanary-69bc8b644d-swk5v from kube-apiserver: pods "noncanary-69bc8b644d-swk5v" not found
2020-06-29T19:35:11.240249Z error Endpoint without pod 10.20.214.31 web-sm.web
2020-06-29T19:35:54.905782Z warn failed to get pod web/noncanary-69bc8b644d-swk5v from kube-apiserver: pods "noncanary-69bc8b644d-swk5v" not found
2020-06-29T19:35:54.905810Z error Endpoint without pod 10.20.214.31 noncanary.web
The metrics pilot_k8s_endpoints_with_no_pods
would continue to increase to 10x or more the number of endpoints this deployment has. The metrics increments when pilot can’t lookup a pod for whatever reason, see https://github.com/istio/istio/blob/a78680cc3f0c811c3f8056fd80e0ba8b9f487f5d/pilot/pkg/serviceregistry/kube/controller/endpoints.go#L219
We can recover somewhat by restarting all pilot instances, but some of the istio-proxy can still be missing ingress (egress seems to always bound after pilot restarts).
I suspect this is caused by low QPS/Burst to Kuberentes API server, and possibly an edge case in handling of stale endpoints that need one lookup before removing them, and causing the limited API call quota to be used on unneeded work.
[ ] Configuration Infrastructure [ ] Docs [ ] Installation [ ] Networking [x] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure
Expected behavior
istio-proxy of a large deployment should work as expected.
Steps to reproduce the bug
Doing a rolling update of a large deployment (300+ replicas)
Version (include the output of istioctl version --remote
and kubectl version
and helm version
if you used Helm)
1.4.10
How was Istio installed?
hem/custom
Environment where bug was observed (cloud vendor, OS, etc)
Baremetal on-prem Kubernetes cluster
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 23 (10 by maintainers)
@phsiao I reproduced in 1.6, I think your hypothesis is correct. great find!
We have experienced same issue as reported here with Pilot 1.4.5 and EKS 1.14, it caused by sudden traffic spike that up-scaled Istio Ingress Gateway rapidly (300+ pods) and down-scaled rapidly also after traffic spike passed. This issue caused Istio Ingress Gateway route request to destroyed pods, as well as incorrect services.
We have troubleshot the Pilot source code along with Kubernetes client-go package. We would like to share our findings here, thought on fixing this issue and look for a solution that does not require us to maintain a customized Pilot version.
How does Pilot receive and handle event? During troubleshoot pilot code on our test cluster, here is how Pilot and its event handling works.
SharedIndexInformer
([source code]) to watchServices
,Nodes
,Endpoints
andPod
events; (source code)Indexer
; [source code]DeltaFIFO
(source code) queue and processed byHandleDeltas()
(source code);sharedIndexInformer
will add, update or delete its own cache before distribute event to Pilot to be handled. Forendpoints
event, Pilot will then call updateEDS() and push update to proxies;How does updateEDS() work out endpoints? Pilot will try to loop through all endpoint IP addresses presented in
endpoints
event data and try to work out correct endpoints before updating proxiesWhen does this issue happen?
Why this issue caused Ingress Gateway route request to destroyed pod or incorrect service?
What is the impact?
Thought to fix
It is, but the issue is caused by the stalled reconciliation process due to rate limiting — during a large deployment update it may take a while for all those inline API calls to complete and as a result, the system is stale and misbehaving. By making those API calls the system risks to be put into this situation. We have tried to tune the QPS to be as high as possible but it only relive it and not eliminate it.
Regarding memory leaks, I did not pay attention, but now I go back to metrics I do see rapid increase in memory usage from the discovery containers during the time of troubles.
Does it still make sense for us to help investigating this issue in 1.4, or we should focus on upgrading to 1.6 first? If this part of the code did not change much from 1.4 to 1.6, it might be easier for us to focus on what we have installed and port that to 1.6 later.