istio: CDS/EDC stuck at warming causing Envoy main thread CPU spike

Is this the right place to submit this?

  • This is not a security vulnerability or a crashing bug
  • This is not a question about how to use Istio

Bug Description

We saw the envoy main thread spike suddenly in all ingressgw pods for a specific environment at around 2023-12-17 12:30:00 Screenshot 2023-12-22 at 11 31 42

{node=“ingressgw nodes”, pid=“126500”, procname=“envoy”, threadname=“envoy”}

According to envoy cpu profiling, it was busy with cds/eds update: image

Meanwhile there was warming cluster spike: Screenshot 2023-12-22 at 11 36 07

The cpu spike happened across all igw pods within few mins after istiod ads timeout writing to igw pods:

2023-12-17 12:29:28.009	
2023-12-17T04:29:28.442568Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-2362
2023-12-17 12:28:41.007	
2023-12-17T04:28:41.383011Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-vwcxm.istio-staging-2250
2023-12-17 12:27:54.004	
2023-12-17T04:27:54.906386Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-dwsk2.istio-staging-2194
2023-12-17 12:26:19.008	
2023-12-17T04:26:19.458129Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-mzngc.istio-staging-1926
2023-12-17 12:26:11.008	
2023-12-17T04:26:11.003366Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-gplnk.istio-staging-2001
2023-12-17 12:25:32.007	
2023-12-17T04:25:32.501602Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-nfncg.istio-staging-1916
2023-12-17 12:24:33.002	
2023-12-17T04:24:33.944147Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-vwcxm.istio-staging-556
2023-12-17 12:24:33.002	
2023-12-17T04:24:33.643084Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-1814
2023-12-17 12:21:01.000	
2023-12-17T04:21:01.958495Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-mzngc.istio-staging-1693

And xds_proxy stuck at downstream [2125] send error: rpc error: code = Unavailable desc = transport is closing and never recovered: Screenshot 2023-12-22 at 11 40 00 Correlate istid, xds_proxy, envoy logs:


2023-12-17 11:52:52.001 | 2023-12-17T03:52:52.571932Z	info	ads	CDS: PUSH for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:2391 size:3.6MB cached:2383/2390 |  
-- | -- | --
  |   | 2023-12-17 11:52:52.001 | 2023-12-17T03:52:52.541471Z	info	ads	EDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:2220 size:1.6MB empty:184 cached:9/2220 |  
  |   | 2023-12-17 11:52:52.001 | 2023-12-17T03:52:52.356833Z	info	ads	ADS:EDS: FORCE RESPONSE istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-553 for warming. |  
  |   | 2023-12-17 11:52:52.001 | 2023-12-17T03:52:52.355977Z	info	ads	CDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:2391 size:3.6MB cached:2383/2390 |  
  |   | 2023-12-17 11:52:52.001 | 2023-12-17T03:52:52.340069Z	info	ads	RDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:1409 size:2.3MB cached:0/0 |  
  |   | 2023-12-17 11:52:52.001 | 2023-12-17T03:52:52.247301Z	info	ads	SDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:1140 size:10.0MB cached:90/1148 |  
  |   | 2023-12-17 11:52:17.000 | 2023-12-17T03:52:17.641377Z	info	ads	LDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:1209 size:9.5MB |  
  |   | 2023-12-17 11:52:17.000 | 2023-12-17T03:52:17.161653Z	warn	buildGatewayListeners: skipping privileged gateway port 22 for node istio-ingressgateway-79cb4b564b-zpwp4.istio-staging as it is an unprivileged pod |  
  |   | 2023-12-17 11:52:17.000 | 2023-12-17T03:52:17.160563Z	info	ads	EDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:2220 size:1.6MB empty:184 cached:9/2220 |  
  |   | 2023-12-17 11:52:16.005 | 2023-12-17T03:52:16.919875Z	info	xdsproxy	connected to upstream XDS server: istiod-staging-117.istio-staging.svc.130.tess.io:15012 |  
  |   | 2023-12-17 11:52:16.000 | 2023-12-17T03:52:16.980845Z	info	ads	ADS: new connection for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-553 |  
  |   | 2023-12-17 11:52:12.005 | 2023-12-17T03:52:12.490649Z	warning	envoy config external/envoy/source/common/config/grpc_stream.h:163	StreamAggregatedResources gRPC config stream to xds-grpc closed: 13, 	thread=26 |  
  |   | 2023-12-17 11:50:55.005 | 2023-12-17T03:50:55.351713Z	error	xdsproxy	downstream [2112] dropped xds push to Envoy, connection already closed |  
  |   | 2023-12-17 11:50:55.005 | 2023-12-17T03:50:55.351626Z	error	xdsproxy	downstream [2112] send error: rpc error: code = Unavailable desc = transport is closing |  
  |   | 2023-12-17 11:47:17.005 | 2023-12-17T03:47:17.076961Z	info	xdsproxy	connected to upstream XDS server: istiod-staging-117.istio-staging.svc.130.tess.io:15012 |  
  |   | 2023-12-17 11:47:12.005 | 2023-12-17T03:47:12.581152Z	warning	envoy config external/envoy/source/common/config/grpc_stream.h:163	StreamAggregatedResources gRPC config stream to xds-grpc closed: 13, 	thread=26 |  
  |   | 2023-12-17 11:46:32.005 | 2023-12-17T03:46:32.518902Z	error	xdsproxy	downstream [2111] dropped xds push to Envoy, connection already closed |  
  |   | 2023-12-17 11:46:32.005 | 2023-12-17T03:46:32.518836Z	error	xdsproxy	downstream [2111] send error: rpc error: code = Unavailable desc = transport is closing |  
  |   | 2023-12-17 11:46:32.003 | 2023-12-17T03:46:32.519964Z	info	ads	Client closed connection istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-411 |  
  |   | 2023-12-17 11:46:32.003 | 2023-12-17T03:46:32.519862Z	info	ads	ADS: "10.0.0.1:35190" istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-411 terminated |  
  |   | 2023-12-17 11:46:32.003 | 2023-12-17T03:46:32.519861Z	warn	ads	LDS: Send failure for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:1209 size:9.5MB: rpc error: code = DeadlineExceeded desc = timeout sending |  
  |   | 2023-12-17 11:46:32.003 | 2023-12-17T03:46:32.519806Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-411 |  
  |   | 2023-12-17 11:46:32.003 | 2023-12-17T03:46:32.519806Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-411 |  
  |   | 2023-12-17 11:46:17.002 | 2023-12-17T03:46:17.778559Z	warn	buildGatewayListeners: skipping privileged gateway port 22 for node istio-ingressgateway-79cb4b564b-zpwp4.istio-staging as it is an unprivileged pod |  
  |   | 2023-12-17 11:46:17.002 | 2023-12-17T03:46:17.776790Z	info	ads	CDS: PUSH for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:2391 size:3.6MB cached:2390/2390 |  
  |   | 2023-12-17 11:46:17.002 | 2023-12-17T03:46:17.747734Z	info	ads	RDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:1409 size:2.3MB cached:0/0 |  
  |   | 2023-12-17 11:46:17.002 | 2023-12-17T03:46:17.647653Z	info	ads	SDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:1140 size:10.0MB cached:49/1148 |  
  |   | 2023-12-17 11:45:53.001 | 2023-12-17T03:45:53.190825Z	info	ads	LDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:1209 size:9.5MB |  
  |   | 2023-12-17 11:45:52.005 | 2023-12-17T03:45:52.513734Z	info	xdsproxy	connected to upstream XDS server: istiod-staging-117.istio-staging.svc.130.tess.io:15012 |  
  |   | 2023-12-17 11:45:52.005 | 2023-12-17T03:45:52.365066Z	warning	envoy config external/envoy/source/common/config/grpc_stream.h:163	StreamAggregatedResources gRPC config stream to xds-grpc closed: 2, upstream [2110] send error for type url type.googleapis.com/envoy.config.cluster.v3.Cluster: EOF	thread=26 |  
  |   | 2023-12-17 11:45:52.005 | 2023-12-17T03:45:52.364619Z	warn	xdsproxy	upstream [2110] terminated with unexpected error upstream [2110] send error for type url type.googleapis.com/envoy.config.cluster.v3.Cluster: EOF |  
  |   | 2023-12-17 11:45:52.001 | 2023-12-17T03:45:52.855763Z	warn	buildGatewayListeners: skipping privileged gateway port 22 for node istio-ingressgateway-79cb4b564b-zpwp4.istio-staging as it is an unprivileged pod |  
  |   | 2023-12-17 11:45:52.001 | 2023-12-17T03:45:52.853719Z	info	ads	EDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:2220 size:1.6MB empty:184 cached:8/2220 |  
  |   | 2023-12-17 11:45:52.001 | 2023-12-17T03:45:52.614864Z	info	ads	CDS: PUSH request for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:2391 size:3.6MB cached:2380/2390 |  
  |   | 2023-12-17 11:45:52.001 | 2023-12-17T03:45:52.590630Z	info	ads	ADS: new connection for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-411

Here is the sequence:

  1. There was a CDS push:
2023-12-17 11:46:17.002 | 2023-12-17T03:46:17.776790Z	info	ads	CDS: PUSH for node:istio-ingressgateway-79cb4b564b-zpwp4.istio-staging resources:2391 size:3.6MB cached:2390/2390
  1. Istiod ads timed out since we configured PILOT_XDS_SEND_TIMEOUT 20s (we plan to remove since this was a legacy configure from 1.10) and this timed out error never recovered:
2023-12-17 11:46:32.003 | 2023-12-17T03:46:32.519806Z	info	ads	Timeout writing istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-411
  1. xds_proxy got transport close error, at this point the istiod to xds_proxy GRPC connection seems good but xds_proxy to envoy downstream GRPC connection is broken:
2023-12-17 11:46:32.005 | 2023-12-17T03:46:32.518836Z	error	xdsproxy	downstream [2111] send error: rpc error: code = Unavailable desc = transport is closing

5.Istiod keeps EDS pushing to complete CDS warming but never succeeds:

2023-12-17 11:52:52.001 | 2023-12-17T03:52:52.356833Z	info	ads	ADS:EDS: FORCE RESPONSE istio-ingressgateway-79cb4b564b-zpwp4.istio-staging-553 for warming.

The warming cluster/main thread spike didn’t recover until kill one istiod pod to force ingressgw reconnect and full push, it seems to be similar issue as https://github.com/istio/istio/issues/31943, remove the PILOT_XDS_SEND_TIMEOUT and leverage http2 keepalive might help, but there seems to be a bug that clusters stuck at warming, not sure it’s in envoy or istiod/xds_proxy.

The thread spike will cause control plane downgrade and eventually cause data plane, if there is code rollout in workload deploy, it might result in 503 UH or time out since the new endpoints won’t be updated to the ingressgw upstream clusters

Expected behavior: Clusters should not stuck at warming and EDS push should succeed to make clusters active.

Version

istio 1.17
k8s 1.23

Additional Information

No response

About this issue

  • Original URL
  • State: open
  • Created 6 months ago
  • Comments: 20 (19 by maintainers)

Most upvoted comments

istioctl pc log can change it inflight