istio: CDS/EDC stuck at warming causing Envoy main thread CPU spike
Is this the right place to submit this?
- This is not a security vulnerability or a crashing bug
- This is not a question about how to use Istio
Bug Description
We saw the envoy main thread spike suddenly in all ingressgw pods for a specific environment at around 2023-12-17 12:30:00
{node=“ingressgw nodes”, pid=“126500”, procname=“envoy”, threadname=“envoy”}
According to envoy cpu profiling, it was busy with cds/eds update:
Meanwhile there was warming cluster spike:
The cpu spike happened across all igw pods within few mins after istiod ads timeout writing to igw pods:
2023-12-17 12:29:28.009
2023-12-17T04:29:28.442568Z info ads Timeout writing istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-2362
2023-12-17 12:28:41.007
2023-12-17T04:28:41.383011Z info ads Timeout writing istio-ingressgateway-79cb4b564b-vwcxm.istio-staging-2250
2023-12-17 12:27:54.004
2023-12-17T04:27:54.906386Z info ads Timeout writing istio-ingressgateway-79cb4b564b-dwsk2.istio-staging-2194
2023-12-17 12:26:19.008
2023-12-17T04:26:19.458129Z info ads Timeout writing istio-ingressgateway-79cb4b564b-mzngc.istio-staging-1926
2023-12-17 12:26:11.008
2023-12-17T04:26:11.003366Z info ads Timeout writing istio-ingressgateway-79cb4b564b-gplnk.istio-staging-2001
2023-12-17 12:25:32.007
2023-12-17T04:25:32.501602Z info ads Timeout writing istio-ingressgateway-79cb4b564b-nfncg.istio-staging-1916
2023-12-17 12:24:33.002
2023-12-17T04:24:33.944147Z info ads Timeout writing istio-ingressgateway-79cb4b564b-vwcxm.istio-staging-556
2023-12-17 12:24:33.002
2023-12-17T04:24:33.643084Z info ads Timeout writing istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-1814
2023-12-17 12:21:01.000
2023-12-17T04:21:01.958495Z info ads Timeout writing istio-ingressgateway-79cb4b564b-mzngc.istio-staging-1693
And xds_proxy stuck at downstream [2125] send error: rpc error: code = Unavailable desc = transport is closing
and never recovered:
Correlate istid, xds_proxy, envoy logs:
2023-12-17 11:52:52.001 | 2023-12-17T03:52:52.571932Z info ads CDS: PUSH for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:2391 size:3.6MB cached:2383/2390 |
-- | -- | --
| | 2023-12-17 11:52:52.001 | 2023-12-17T03:52:52.541471Z info ads EDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:2220 size:1.6MB empty:184 cached:9/2220 |
| | 2023-12-17 11:52:52.001 | 2023-12-17T03:52:52.356833Z info ads ADS:EDS: FORCE RESPONSE istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-553 for warming. |
| | 2023-12-17 11:52:52.001 | 2023-12-17T03:52:52.355977Z info ads CDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:2391 size:3.6MB cached:2383/2390 |
| | 2023-12-17 11:52:52.001 | 2023-12-17T03:52:52.340069Z info ads RDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:1409 size:2.3MB cached:0/0 |
| | 2023-12-17 11:52:52.001 | 2023-12-17T03:52:52.247301Z info ads SDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:1140 size:10.0MB cached:90/1148 |
| | 2023-12-17 11:52:17.000 | 2023-12-17T03:52:17.641377Z info ads LDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:1209 size:9.5MB |
| | 2023-12-17 11:52:17.000 | 2023-12-17T03:52:17.161653Z warn buildGatewayListeners: skipping privileged gateway port 22 for node istio-ingressgateway-79cb4b564b-zpwp4.istio-staging as it is an unprivileged pod |
| | 2023-12-17 11:52:17.000 | 2023-12-17T03:52:17.160563Z info ads EDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:2220 size:1.6MB empty:184 cached:9/2220 |
| | 2023-12-17 11:52:16.005 | 2023-12-17T03:52:16.919875Z info xdsproxy connected to upstream XDS server: istiod-staging-117.istio-staging.svc.130.tess.io:15012 |
| | 2023-12-17 11:52:16.000 | 2023-12-17T03:52:16.980845Z info ads ADS: new connection for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-553 |
| | 2023-12-17 11:52:12.005 | 2023-12-17T03:52:12.490649Z warning envoy config external/envoy/source/common/config/grpc_stream.h:163 StreamAggregatedResources gRPC config stream to xds-grpc closed: 13, thread=26 |
| | 2023-12-17 11:50:55.005 | 2023-12-17T03:50:55.351713Z error xdsproxy downstream [2112] dropped xds push to Envoy, connection already closed |
| | 2023-12-17 11:50:55.005 | 2023-12-17T03:50:55.351626Z error xdsproxy downstream [2112] send error: rpc error: code = Unavailable desc = transport is closing |
| | 2023-12-17 11:47:17.005 | 2023-12-17T03:47:17.076961Z info xdsproxy connected to upstream XDS server: istiod-staging-117.istio-staging.svc.130.tess.io:15012 |
| | 2023-12-17 11:47:12.005 | 2023-12-17T03:47:12.581152Z warning envoy config external/envoy/source/common/config/grpc_stream.h:163 StreamAggregatedResources gRPC config stream to xds-grpc closed: 13, thread=26 |
| | 2023-12-17 11:46:32.005 | 2023-12-17T03:46:32.518902Z error xdsproxy downstream [2111] dropped xds push to Envoy, connection already closed |
| | 2023-12-17 11:46:32.005 | 2023-12-17T03:46:32.518836Z error xdsproxy downstream [2111] send error: rpc error: code = Unavailable desc = transport is closing |
| | 2023-12-17 11:46:32.003 | 2023-12-17T03:46:32.519964Z info ads Client closed connection istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-411 |
| | 2023-12-17 11:46:32.003 | 2023-12-17T03:46:32.519862Z info ads ADS: "10.0.0.1:35190" istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-411 terminated |
| | 2023-12-17 11:46:32.003 | 2023-12-17T03:46:32.519861Z warn ads LDS: Send failure for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:1209 size:9.5MB: rpc error: code = DeadlineExceeded desc = timeout sending |
| | 2023-12-17 11:46:32.003 | 2023-12-17T03:46:32.519806Z info ads Timeout writing istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-411 |
| | 2023-12-17 11:46:32.003 | 2023-12-17T03:46:32.519806Z info ads Timeout writing istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-411 |
| | 2023-12-17 11:46:17.002 | 2023-12-17T03:46:17.778559Z warn buildGatewayListeners: skipping privileged gateway port 22 for node istio-ingressgateway-79cb4b564b-zpwp4.istio-staging as it is an unprivileged pod |
| | 2023-12-17 11:46:17.002 | 2023-12-17T03:46:17.776790Z info ads CDS: PUSH for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:2391 size:3.6MB cached:2390/2390 |
| | 2023-12-17 11:46:17.002 | 2023-12-17T03:46:17.747734Z info ads RDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:1409 size:2.3MB cached:0/0 |
| | 2023-12-17 11:46:17.002 | 2023-12-17T03:46:17.647653Z info ads SDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:1140 size:10.0MB cached:49/1148 |
| | 2023-12-17 11:45:53.001 | 2023-12-17T03:45:53.190825Z info ads LDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:1209 size:9.5MB |
| | 2023-12-17 11:45:52.005 | 2023-12-17T03:45:52.513734Z info xdsproxy connected to upstream XDS server: istiod-staging-117.istio-staging.svc.130.tess.io:15012 |
| | 2023-12-17 11:45:52.005 | 2023-12-17T03:45:52.365066Z warning envoy config external/envoy/source/common/config/grpc_stream.h:163 StreamAggregatedResources gRPC config stream to xds-grpc closed: 2, upstream [2110] send error for type url type.googleapis.com/envoy.config.cluster.v3.Cluster: EOF thread=26 |
| | 2023-12-17 11:45:52.005 | 2023-12-17T03:45:52.364619Z warn xdsproxy upstream [2110] terminated with unexpected error upstream [2110] send error for type url type.googleapis.com/envoy.config.cluster.v3.Cluster: EOF |
| | 2023-12-17 11:45:52.001 | 2023-12-17T03:45:52.855763Z warn buildGatewayListeners: skipping privileged gateway port 22 for node istio-ingressgateway-79cb4b564b-zpwp4.istio-staging as it is an unprivileged pod |
| | 2023-12-17 11:45:52.001 | 2023-12-17T03:45:52.853719Z info ads EDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:2220 size:1.6MB empty:184 cached:8/2220 |
| | 2023-12-17 11:45:52.001 | 2023-12-17T03:45:52.614864Z info ads CDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:2391 size:3.6MB cached:2380/2390 |
| | 2023-12-17 11:45:52.001 | 2023-12-17T03:45:52.590630Z info ads ADS: new connection for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-411
Here is the sequence:
- There was a CDS push:
2023-12-17 11:46:17.002 | 2023-12-17T03:46:17.776790Z info ads CDS: PUSH for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:2391 size:3.6MB cached:2390/2390
- Istiod ads timed out since we configured
PILOT_XDS_SEND_TIMEOUT
20s (we plan to remove since this was a legacy configure from 1.10) and this timed out error never recovered:
2023-12-17 11:46:32.003 | 2023-12-17T03:46:32.519806Z info ads Timeout writing istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-411
- xds_proxy got transport close error, at this point the istiod to xds_proxy GRPC connection seems good but xds_proxy to envoy downstream GRPC connection is broken:
2023-12-17 11:46:32.005 | 2023-12-17T03:46:32.518836Z error xdsproxy downstream [2111] send error: rpc error: code = Unavailable desc = transport is closing
5.Istiod keeps EDS pushing to complete CDS warming but never succeeds:
2023-12-17 11:52:52.001 | 2023-12-17T03:52:52.356833Z info ads ADS:EDS: FORCE RESPONSE istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-553 for warming.
The warming cluster/main thread spike didn’t recover until kill one istiod pod to force ingressgw reconnect and full push, it seems to be similar issue as https://github.com/istio/istio/issues/31943, remove the PILOT_XDS_SEND_TIMEOUT and leverage http2 keepalive might help, but there seems to be a bug that clusters stuck at warming, not sure it’s in envoy or istiod/xds_proxy.
The thread spike will cause control plane downgrade and eventually cause data plane, if there is code rollout in workload deploy, it might result in 503 UH or time out since the new endpoints won’t be updated to the ingressgw upstream clusters
Expected behavior: Clusters should not stuck at warming and EDS push should succeed to make clusters active.
Version
istio 1.17
k8s 1.23
Additional Information
No response
About this issue
- Original URL
- State: open
- Created 6 months ago
- Comments: 20 (19 by maintainers)
istioctl pc log
can change it inflight