istio: Istio ingress still forwarding request to a pod that got marked terminated 9mins ago
Bug Description
Hi Team, We are running Istio (1.17.1-distroless) in our EKS cluster, we found out that during the event when a replicaset scaled down rapidly, from 350 Pods to 180 in around 10 mins, we faced a lot of 503s. This we beleive is due to the fact that the istio ingressgateway still makes a request to the terminated pod’s proxy, even after 9 minutes of the Endpoints update. The Pod count in the EKS cluster is 3K and Nodes being ~250. There are about 200 services too.
Here is a chronology of the events:
- The pod(*-rbzcp) got marked for deletion by the replicaset controller at
2023-05-17T02:39:02.754863Z
- The IP of the pod got updated in the endpoint
2023-05-17T02:39:02.891734Z
- The pod got deleted from the node’s kubelet at
2023-05-17T02:42:08.759850Z
The istio proxy logs on the same pod:
2023-05-17T02:42:07.910757Z warn Aborted proxy instance
2023-05-17T02:42:07.910766Z warn Aborting proxy
2023-05-17T02:42:07.910953Z info Envoy aborted normally
2023-05-17T02:42:07.910966Z info Agent has successfully terminated
2023-05-17T02:42:07.911016Z error error serving tap http server: http: Server closed
2023-05-17T02:42:07.911374Z info ads ADS: "@"<REDACTED>-1 terminated
2023-05-17T02:42:07.911388Z info ads ADS: "@" <REDACTED>-2 terminated
2023-05-17T02:42:07.911402Z info sds SDS server for workload certificates started, listening on "./var/run/secrets/workload-spiffe-uds/socket"
There is no log from the pod in question(*-rbzcp), after this. The image shows that after marked terminated, the ingress continued to forward requests to the pod, getting 503s.
At this time, the pod count of this deployment reduced as shown below (HPA scale-in), from 330 to 180 over 15 mins. (Time in EST rather than UTC)
We are running 50 replicas of Istiod but they seem to throttled at that time (please not that throttling is there at other occurrences too, but we are not seeing 503/outages)
Version
istioctl version:
client version: 1.15.3
control plane version: 1.17.1
data plane version: 1.17.1 (59 proxies)
kubectl version --short
Client Version: v1.23.6
Server Version: v1.23.16-eks-48e63af
helm version --short
v3.6.3+gd506314
Additional Information
Istiod logs for that pod:
2023-05-17T02:42:07.911631Z info ads ADS: "10.64.5.57:36606" <REDACTED>-rbzcp.<REDACTED>-27184 terminated
2023-05-17T02:32:53.743852Z info ads RDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:21 size:1.6MB cached:10/21
2023-05-17T02:32:52.883771Z info ads NDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:1 size:27.2kB
2023-05-17T02:32:51.939697Z info ads LDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:63 size:291.3kB
2023-05-17T02:32:51.929958Z info ads EDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:587 size:1.1MB empty:0 cached:586/587
2023-05-17T02:32:50.755431Z info ads ADS: "10.64.5.57:48574" <REDACTED>-rbzcp.<REDACTED>-27024 terminated
2023-05-17T02:32:51.001296Z info ads CDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:654 size:837.2kB cached:644/650
2023-05-17T02:32:50.995933Z info ads ADS: new connection for node:<REDACTED>-rbzcp.<REDACTED>-27184
2023-05-17T02:23:28.027324Z info ads NDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:1 size:27.2kB
2023-05-17T02:23:28.023420Z info ads RDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:21 size:1.6MB cached:12/21
2023-05-17T02:23:28.001766Z info ads LDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:63 size:291.3kB
2023-05-17T02:23:27.996641Z info ads EDS: PUSH INC for node:<REDACTED>-rbzcp.<REDACTED> resources:2 size:155.6kB empty:0 cached:0/2
2023-05-17T02:23:27.167235Z info ads CDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:654 size:837.2kB cached:650/650
2023-05-17T02:18:49.324106Z info ads NDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:1 size:27.2kB
2023-05-17T02:18:49.320813Z info ads RDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:21 size:1.6MB cached:12/21
2023-05-17T02:18:49.297607Z info ads LDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:63 size:291.3kB
2023-05-17T02:18:49.291275Z info ads EDS: PUSH INC for node:<REDACTED>-rbzcp.<REDACTED> resources:1 size:143.5kB empty:0 cached:0/1
2023-05-17T02:18:48.444070Z info ads CDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:654 size:837.2kB cached:645/650
2023-05-17T02:16:20.119399Z info ads NDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:1 size:27.2kB
2023-05-17T02:16:20.115814Z info ads RDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:21 size:1.6MB cached:12/21
2023-05-17T02:16:20.091425Z info ads LDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:63 size:291.3kB
2023-05-17T02:16:20.084363Z info ads EDS: PUSH INC for node:<REDACTED>-rbzcp.<REDACTED> resources:1 size:190.7kB empty:0 cached:0/1
2023-05-17T02:16:19.217966Z info ads CDS: PUSH for node:<REDACTED>-rbzcp.<REDACTED> resources:654 size:837.2kB cached:650/650
2023-05-17T02:02:31.076854Z info ads ADS: "10.64.5.57:34598" <REDACTED>-rbzcp.<REDACTED>-27211 terminated
2023-05-17T02:02:31.207645Z info ads RDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:21 size:1.6MB cached:12/21
2023-05-17T02:02:31.181192Z info ads NDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:1 size:27.2kB
2023-05-17T02:02:31.180526Z info ads LDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:63 size:291.3kB
2023-05-17T02:02:31.171874Z info ads EDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:587 size:1.1MB empty:0 cached:587/587
2023-05-17T02:02:31.167626Z info ads CDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:654 size:837.2kB cached:629/650
2023-05-17T02:02:31.162313Z info ads ADS: new connection for node:<REDACTED>-rbzcp.<REDACTED>-27024
2023-05-17T01:31:21.527483Z info ads RDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:21 size:1.6MB cached:12/21
2023-05-17T01:31:21.482855Z info ads NDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:1 size:27.2kB
2023-05-17T01:31:21.482188Z info ads LDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:63 size:291.3kB
2023-05-17T01:31:21.473726Z info ads EDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:587 size:1.1MB empty:0 cached:587/587
2023-05-17T01:31:21.470405Z info ads CDS: PUSH request for node:<REDACTED>-rbzcp.<REDACTED> resources:654 size:837.2kB cached:624/650
2023-05-17T01:31:21.465776Z info ads ADS: new connection for node:<REDACTED>-rbzcp.<REDACTED>-27211
2023-05-17T01:31:21.368391Z info ads ADS: "10.64.5.57:33552" <REDACTED>-rbzcp.<REDACTED>-26893 terminated
Affected product area
- Ambient
- Docs
- Installation
- Networking
- Performance and Scalability
- Extensions and Telemetry
- Security
- Test and Release
- User Experience
- Developer Infrastructure
- Upgrade
- Multi Cluster
- Virtual Machine
- Control Plane Revisions
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 61 (47 by maintainers)
Commits related to this issue
- eds: simplify and optimize mTLS checker * Remove all caching. This is expensive and offers little benefit; the issue here is the mtlsChecker is built per endpointBuilder, which is per-proxy-cluster... — committed to howardjohn/istio by howardjohn a year ago
- eds: simplify and optimize mTLS checker (#45230) * eds: simplify and optimize mTLS checker * Remove all caching. This is expensive and offers little benefit; the issue here is the mtlsChecker is b... — committed to istio/istio by howardjohn a year ago
- pilot: autoscale concurrency and rate limit on xds This is an attempt to improve the defaults here. 25 for RL is decent, but for giant clusters we have seen folks report its too small. The bigge... — committed to howardjohn/istio by howardjohn 6 months ago
- pilot: autoscale concurrency and rate limit on xds This is an attempt to improve the defaults here. 25 for RL is decent, but for giant clusters we have seen folks report its too small. The bigge... — committed to howardjohn/istio by howardjohn 6 months ago
- pilot: autoscale concurrency and rate limit on xds This is an attempt to improve the defaults here. 25 for RL is decent, but for giant clusters we have seen folks report its too small. The bigge... — committed to howardjohn/istio by howardjohn 6 months ago
- pilot: autoscale concurrency and rate limit on xds This is an attempt to improve the defaults here. 25 for RL is decent, but for giant clusters we have seen folks report its too small. The bigge... — committed to howardjohn/istio by howardjohn 6 months ago
- pilot: autoscale concurrency and rate limit on xds (#48438) This is an attempt to improve the defaults here. 25 for RL is decent, but for giant clusters we have seen folks report its too small. ... — committed to istio/istio by howardjohn 6 months ago
Some other ideas:
update endpoints to {a,b}
,update endpoints to {a}
]; this can be collapsed down to a single event.Sorry markdown was messed up. Was just commenting I tried tuning GOMAXPROCS on my own cluster and saw no benefits.