istio: Stale endpoints may cause istio-pilot to be stuck

Bug description

We are running 1.4.5-10.

We have a service that has 300+ replicas, and in the process of doing a rolling update, we would see istio-proxy to not come up with proper ingress/egress port bound as specified in the matching sidecar object.

Looking at the pilot log, right after the rolling update we will start seeing these in the log

2020-06-29T19:34:26.897583Z	warn	failed to get pod web/noncanary-69bc8b644d-swk5v from kube-apiserver: pods "noncanary-69bc8b644d-swk5v" not found
2020-06-29T19:34:26.897612Z	error	Endpoint without pod 10.20.214.31 noncanary.web
2020-06-29T19:34:27.114441Z	warn	failed to get pod web/noncanary-69bc8b644d-2hs9d from kube-apiserver: pods "noncanary-69bc8b644d-2hs9d" not found
2020-06-29T19:34:27.114470Z	error	Endpoint without pod 10.20.214.36 noncanary.web
2020-06-29T19:34:27.303719Z	warn	failed to get pod web/noncanary-69bc8b644d-cvnsx from kube-apiserver: pods "noncanary-69bc8b644d-cvnsx" not found
2020-06-29T19:34:27.303749Z	error	Endpoint without pod 10.20.214.4 noncanary.web
2020-06-29T19:34:27.497450Z	warn	failed to get pod web/noncanary-69bc8b644d-bsvmp from kube-apiserver: pods "noncanary-69bc8b644d-bsvmp" not found
2020-06-29T19:34:27.497483Z	error	Endpoint without pod 10.20.215.123 noncanary.web
2020-06-29T19:34:27.707925Z	warn	failed to get pod web/noncanary-69bc8b644d-f85fl from kube-apiserver: pods "noncanary-69bc8b644d-f85fl" not found
2020-06-29T19:34:27.707962Z	error	Endpoint without pod 10.20.215.222 noncanary.web

And these errors repeat for the same endpoint

2020-06-29T19:35:11.240233Z	warn	failed to get pod web/noncanary-69bc8b644d-swk5v from kube-apiserver: pods "noncanary-69bc8b644d-swk5v" not found
2020-06-29T19:35:11.240249Z	error	Endpoint without pod 10.20.214.31 web-sm.web
2020-06-29T19:35:54.905782Z	warn	failed to get pod web/noncanary-69bc8b644d-swk5v from kube-apiserver: pods "noncanary-69bc8b644d-swk5v" not found
2020-06-29T19:35:54.905810Z	error	Endpoint without pod 10.20.214.31 noncanary.web

The metrics pilot_k8s_endpoints_with_no_pods would continue to increase to 10x or more the number of endpoints this deployment has. The metrics increments when pilot can’t lookup a pod for whatever reason, see https://github.com/istio/istio/blob/a78680cc3f0c811c3f8056fd80e0ba8b9f487f5d/pilot/pkg/serviceregistry/kube/controller/endpoints.go#L219

We can recover somewhat by restarting all pilot instances, but some of the istio-proxy can still be missing ingress (egress seems to always bound after pilot restarts).

I suspect this is caused by low QPS/Burst to Kuberentes API server, and possibly an edge case in handling of stale endpoints that need one lookup before removing them, and causing the limited API call quota to be used on unneeded work.

[ ] Configuration Infrastructure [ ] Docs [ ] Installation [ ] Networking [x] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Expected behavior

istio-proxy of a large deployment should work as expected.

Steps to reproduce the bug

Doing a rolling update of a large deployment (300+ replicas)

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

1.4.10

How was Istio installed?

hem/custom

Environment where bug was observed (cloud vendor, OS, etc)

Baremetal on-prem Kubernetes cluster

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 23 (10 by maintainers)

Most upvoted comments

@phsiao I reproduced in 1.6, I think your hypothesis is correct. great find!

We have experienced same issue as reported here with Pilot 1.4.5 and EKS 1.14, it caused by sudden traffic spike that up-scaled Istio Ingress Gateway rapidly (300+ pods) and down-scaled rapidly also after traffic spike passed. This issue caused Istio Ingress Gateway route request to destroyed pods, as well as incorrect services.

We have troubleshot the Pilot source code along with Kubernetes client-go package. We would like to share our findings here, thought on fixing this issue and look for a solution that does not require us to maintain a customized Pilot version.

How does Pilot receive and handle event? During troubleshoot pilot code on our test cluster, here is how Pilot and its event handling works.

  • Pilot creates separate SharedIndexInformer ([source code]) to watch Services, Nodes, Endpoints and Pod events; (source code)
  • Each informer has their own cache defined in Indexer; [source code]
  • When an informer watched any change from Kubernetes Master, it pushes them into a DeltaFIFO (source code) queue and processed by HandleDeltas() (source code);
  • sharedIndexInformer will add, update or delete its own cache before distribute event to Pilot to be handled. For endpoints event, Pilot will then call updateEDS() and push update to proxies;
  • Pilot event handler pushed tasks that is combination of event handler and event data into a queue, for example:
  • The task in the queue are executed sequentially. (source code)

How does updateEDS() work out endpoints? Pilot will try to loop through all endpoint IP addresses presented in endpoints event data and try to work out correct endpoints before updating proxies

  • Pilot will invoke c.pods.getPodByIP(ea.IP) to verify corresponding Pod is presented in informer’s cache. For any events, include Pos deleted event, the informer’s cached is updated before the event data even distributed Pilot to handle.
		for _, ss := range ep.Subsets {
			for _, ea := range ss.Addresses {
				pod := c.pods.getPodByIP(ea.IP)
				if pod == nil {
  • If the pod does not exist in informer’s cache Pilot will invoke Kubernetes API to verify Pod existence to avoid eventual consistency.

When does this issue happen?

  • When a Pod is created, the pod add event is received by information and it comes earlier than its IP address appeared in an endpoint update event, this makes the up-scale large amount of Pods without issue;
  • When a Pod is destroyed, the pod delete event and endpoint update event are are received for each pod destroyed, and pod delete event comes later than corresponding endpoint update event. This contributed this issue only occurs when there are large amount of Pods (150+ for our cluster) are destroyed rapidly.

Why this issue caused Ingress Gateway route request to destroyed pod or incorrect service?

  • When large amount of Istio Ingress Gateway pod destroyed by HPA, the endpoints update events created for each pod are received by Pilot and created corresponding task in its task queue and processed sequentially
  • The event processing seems fine till the pod delete event arrives and cause the endpoint update event handler call Kubernetes Master to verify Pod existence;
  • When the Kubernete API call throttled, each API call will take around 200ms, this will significantly delay the each event handling. For example in one of our tests
    • We down-scaled Ingress Gateway from 250 pods to 20 pods, there are 230 endpoint updated events received;
    • Pilot start invoking Kubernetes Master from the endpoint update event that contains 172 endpoint IP addresses;
    • The time require to process this particular event will be 172 * 200ms ~=35s; while there are still 171 Ingress gateway endpoint updated event in it queue.
    • The total number of call to Kubernetes Master for Ingress Gateway pods will be approximately 14,706 times ((n^2 - n)/2), that would take approximate 50 minutes;
    • At the same time, other services in cluster are up-scaling and down-scaling frequently, all the endpoints update events for the services will not be pushed to ingress gateway, and all other events will require calling Kubernetes Master for every IP address presented in endpoint update event.

What is the impact?

  • Istio Ingress Gateway route request to destroyed pod;
  • Istio Ingress Gateway route request to wrong service, due the the IP addresses are recycled and used by new pod created

Thought to fix

  • Can endpoint update event for difference service handling concurrently
  • Dedup endpoint updated event?

What I meant was, notwithstanding rate limiting (which would return a 429 or something), a 404 should always be honored as explicit evidence of non-existence?

It is, but the issue is caused by the stalled reconciliation process due to rate limiting — during a large deployment update it may take a while for all those inline API calls to complete and as a result, the system is stale and misbehaving. By making those API calls the system risks to be put into this situation. We have tried to tune the QPS to be as high as possible but it only relive it and not eliminate it.

Regarding memory leaks, I did not pay attention, but now I go back to metrics I do see rapid increase in memory usage from the discovery containers during the time of troubles.

Screen Shot 2020-06-30 at 7 42 30 PM

Does it still make sense for us to help investigating this issue in 1.4, or we should focus on upgrading to 1.6 first? If this part of the code did not change much from 1.4 to 1.6, it might be easier for us to focus on what we have installed and port that to 1.6 later.