istio: Istio ingress still forwarding request to a pod that got marked terminated 9mins ago

Bug Description

Hi Team, We are running Istio (1.17.1-distroless) in our EKS cluster, we found out that during the event when a replicaset scaled down rapidly, from 350 Pods to 180 in around 10 mins, we faced a lot of 503s. This we beleive is due to the fact that the istio ingressgateway still makes a request to the terminated pod’s proxy, even after 9 minutes of the Endpoints update. The Pod count in the EKS cluster is 3K and Nodes being ~250. There are about 200 services too.

Here is a chronology of the events:

  1. The pod(*-rbzcp) got marked for deletion by the replicaset controller at 2023-05-17T02:39:02.754863Z
  2. The IP of the pod got updated in the endpoint 2023-05-17T02:39:02.891734Z
  3. The pod got deleted from the node’s kubelet at 2023-05-17T02:42:08.759850Z

The istio proxy logs on the same pod:

2023-05-17T02:42:07.910757Z	warn	Aborted proxy instance
2023-05-17T02:42:07.910766Z	warn	Aborting proxy
2023-05-17T02:42:07.910953Z	info	Envoy aborted normally
2023-05-17T02:42:07.910966Z	info	Agent has successfully terminated
2023-05-17T02:42:07.911016Z	error	error serving tap http server: http: Server closed
2023-05-17T02:42:07.911374Z	info	ads	ADS: "@"<REDACTED>-1 terminated
2023-05-17T02:42:07.911388Z	info	ads	ADS: "@" <REDACTED>-2 terminated
2023-05-17T02:42:07.911402Z	info	sds	SDS server for workload certificates started, listening on "./var/run/secrets/workload-spiffe-uds/socket"

There is no log from the pod in question(*-rbzcp), after this. The image shows that after marked terminated, the ingress continued to forward requests to the pod, getting 503s.

Screenshot 2023-05-18 at 17 57 36

At this time, the pod count of this deployment reduced as shown below (HPA scale-in), from 330 to 180 over 15 mins. (Time in EST rather than UTC) Screenshot 2023-05-18 at 17 45 28

We are running 50 replicas of Istiod but they seem to throttled at that time (please not that throttling is there at other occurrences too, but we are not seeing 503/outages) Screenshot 2023-05-18 at 17 50 20

Version

istioctl version:
client version: 1.15.3
control plane version: 1.17.1
data plane version: 1.17.1 (59 proxies)

 kubectl version --short
Client Version: v1.23.6
Server Version: v1.23.16-eks-48e63af

helm version --short
v3.6.3+gd506314

Additional Information

Istiod logs for that pod:

2023-05-17T02:42:07.911631Z	info	ads	ADS: "10.64.5.57:36606" <REDACTED>-rbzcp.<REDACTED>-27184 terminated
2023-05-17T02:32:53.743852Z	info	ads	RDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:21 size:1.6MB cached:10/21
2023-05-17T02:32:52.883771Z	info	ads	NDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:1 size:27.2kB
2023-05-17T02:32:51.939697Z	info	ads	LDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:63 size:291.3kB
2023-05-17T02:32:51.929958Z	info	ads	EDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:587 size:1.1MB empty:0 cached:586/587
2023-05-17T02:32:50.755431Z	info	ads	ADS: "10.64.5.57:48574" <REDACTED>-rbzcp.<REDACTED>-27024 terminated
2023-05-17T02:32:51.001296Z	info	ads	CDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:654 size:837.2kB cached:644/650
2023-05-17T02:32:50.995933Z	info	ads	ADS: new connection for node:<REDACTED>-rbzcp.<REDACTED>-27184
2023-05-17T02:23:28.027324Z	info	ads	NDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:1 size:27.2kB
2023-05-17T02:23:28.023420Z	info	ads	RDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:21 size:1.6MB cached:12/21
2023-05-17T02:23:28.001766Z	info	ads	LDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:63 size:291.3kB
2023-05-17T02:23:27.996641Z	info	ads	EDS: PUSH INC for node:<REDACTED>-rbzcp.<REDACTED> resources:2 size:155.6kB empty:0 cached:0/2
2023-05-17T02:23:27.167235Z	info	ads	CDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:654 size:837.2kB cached:650/650
2023-05-17T02:18:49.324106Z	info	ads	NDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:1 size:27.2kB
2023-05-17T02:18:49.320813Z	info	ads	RDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:21 size:1.6MB cached:12/21
2023-05-17T02:18:49.297607Z	info	ads	LDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:63 size:291.3kB
2023-05-17T02:18:49.291275Z	info	ads	EDS: PUSH INC for node:<REDACTED>-rbzcp.<REDACTED> resources:1 size:143.5kB empty:0 cached:0/1
2023-05-17T02:18:48.444070Z	info	ads	CDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:654 size:837.2kB cached:645/650
2023-05-17T02:16:20.119399Z	info	ads	NDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:1 size:27.2kB
2023-05-17T02:16:20.115814Z	info	ads	RDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:21 size:1.6MB cached:12/21
2023-05-17T02:16:20.091425Z	info	ads	LDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:63 size:291.3kB
2023-05-17T02:16:20.084363Z	info	ads	EDS: PUSH INC for node:<REDACTED>-rbzcp.<REDACTED> resources:1 size:190.7kB empty:0 cached:0/1
2023-05-17T02:16:19.217966Z	info	ads	CDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:654 size:837.2kB cached:650/650
2023-05-17T02:02:31.076854Z	info	ads	ADS: "10.64.5.57:34598" <REDACTED>-rbzcp.<REDACTED>-27211 terminated
2023-05-17T02:02:31.207645Z	info	ads	RDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:21 size:1.6MB cached:12/21
2023-05-17T02:02:31.181192Z	info	ads	NDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:1 size:27.2kB
2023-05-17T02:02:31.180526Z	info	ads	LDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:63 size:291.3kB
2023-05-17T02:02:31.171874Z	info	ads	EDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:587 size:1.1MB empty:0 cached:587/587
2023-05-17T02:02:31.167626Z	info	ads	CDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:654 size:837.2kB cached:629/650
2023-05-17T02:02:31.162313Z	info	ads	ADS: new connection for node:<REDACTED>-rbzcp.<REDACTED>-27024
2023-05-17T01:31:21.527483Z	info	ads	RDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:21 size:1.6MB cached:12/21
2023-05-17T01:31:21.482855Z	info	ads	NDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:1 size:27.2kB
2023-05-17T01:31:21.482188Z	info	ads	LDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:63 size:291.3kB
2023-05-17T01:31:21.473726Z	info	ads	EDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:587 size:1.1MB empty:0 cached:587/587
2023-05-17T01:31:21.470405Z	info	ads	CDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:654 size:837.2kB cached:624/650
2023-05-17T01:31:21.465776Z	info	ads	ADS: new connection for node:<REDACTED>-rbzcp.<REDACTED>-27211
2023-05-17T01:31:21.368391Z	info	ads	ADS: "10.64.5.57:33552" <REDACTED>-rbzcp.<REDACTED>-26893 terminated


Affected product area

  • Ambient
  • Docs
  • Installation
  • Networking
  • Performance and Scalability
  • Extensions and Telemetry
  • Security
  • Test and Release
  • User Experience
  • Developer Infrastructure
  • Upgrade
  • Multi Cluster
  • Virtual Machine
  • Control Plane Revisions

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 61 (47 by maintainers)

Commits related to this issue

Most upvoted comments

Some other ideas:

Sorry markdown was messed up. Was just commenting I tried tuning GOMAXPROCS on my own cluster and saw no benefits.