grpc: grpc servers hanging with a "connection attempt timed out before receiving SETTINGS frame" error

What version of gRPC and what language are you using?

1.62.1, python

What operating system (Linux, Windows,…) and version?

Linux

What runtime / compiler are you using (e.g. python version or version of gcc)

python 3.11

What did you do?

We run an project that is built on grpc and involves running a grpc server.

In the last few months, several different users have reported an issue that always has the same commonalities:

  • The grpc server is still running, but mysteriously stops serving any requests. All requests start failing with the following identical error message (note connection attempt timed out before receiving SETTINGS frame):
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:<redacted>: connection attempt timed out before receiving SETTINGS frame"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-03-22T18:43:43.405427415+00:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:<redacted>: connection attempt timed out before receiving SETTINGS frame"}"
  • The time that it takes before timing out is fairly consistently 20 seconds - I’ve seen some where it was always 7 seconds too, but when it’s happening, its an identical timeout every time.

This is different than the error message I’m used to seeing when a gRPC server is totally inaccessible or is down (and the process is still running / the threads still appear to be ready to serve requests when inspected via py-spy). I’m more accustomed to an error message like this (Failed to connect to remote host: Connection refused):

<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:4001: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:4001: Failed to connect to remote host: Connection refused", grpc_status:14, created_time:"2024-04-04T11:06:16.272108-05:00"}"
>

Unfortunately I do not have a simple or reliable repro for this, but i’m wondering if you all have any recommendations for additional debugging flags we could add or more information that would be helpful to get to the bottom of what might be going on here - or if this error message clearly indicates that we are hitting some timeout with a value that we could tune. Thanks in advance for any guidance you can provide.

What did you expect to see?

A running grpc server

What did you see instead?

A “hanging” grpc server that returns “connection attempt timed out before receiving SETTINGS frame”

Anything else we should know about your project / environment?

If it’s helpful context, the way we initialize our grpc server is here: The way we initialize our grpc server can be found here - we pass a ThreadPoolExecutor into a new grpc server object: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_grpc/server.py#L1184-L1192

  • we create a ThreadPoolExecutor and pass it into a grpc.server object.

About this issue

  • Original URL
  • State: closed
  • Created 3 months ago
  • Comments: 35 (4 by maintainers)

Commits related to this issue

Most upvoted comments

Ok, the issue i have been investigating so far appears to match: https://github.com/googleapis/python-bigtable/issues/949, which will be fixed in the upcoming release of grpcio.

Any of the following mitigations help:

  • upgrade grpcio to 1.62.2 or above
  • downgrade google-api-core to 2.16.2 or below
  • downgrade grpcio to 1.58.0 or below