grpc: grpc servers hanging with a "connection attempt timed out before receiving SETTINGS frame" error
What version of gRPC and what language are you using?
1.62.1, python
What operating system (Linux, Windows,…) and version?
Linux
What runtime / compiler are you using (e.g. python version or version of gcc)
python 3.11
What did you do?
We run an project that is built on grpc and involves running a grpc server.
In the last few months, several different users have reported an issue that always has the same commonalities:
- The grpc server is still running, but mysteriously stops serving any requests. All requests start failing with the following identical error message (note
connection attempt timed out before receiving SETTINGS frame
):
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:<redacted>: connection attempt timed out before receiving SETTINGS frame"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-03-22T18:43:43.405427415+00:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:<redacted>: connection attempt timed out before receiving SETTINGS frame"}"
- The time that it takes before timing out is fairly consistently 20 seconds - I’ve seen some where it was always 7 seconds too, but when it’s happening, its an identical timeout every time.
This is different than the error message I’m used to seeing when a gRPC server is totally inaccessible or is down (and the process is still running / the threads still appear to be ready to serve requests when inspected via py-spy). I’m more accustomed to an error message like this (Failed to connect to remote host: Connection refused
):
<_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:4001: Failed to connect to remote host: Connection refused"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:4001: Failed to connect to remote host: Connection refused", grpc_status:14, created_time:"2024-04-04T11:06:16.272108-05:00"}"
>
Unfortunately I do not have a simple or reliable repro for this, but i’m wondering if you all have any recommendations for additional debugging flags we could add or more information that would be helpful to get to the bottom of what might be going on here - or if this error message clearly indicates that we are hitting some timeout with a value that we could tune. Thanks in advance for any guidance you can provide.
What did you expect to see?
A running grpc server
What did you see instead?
A “hanging” grpc server that returns “connection attempt timed out before receiving SETTINGS frame”
Anything else we should know about your project / environment?
If it’s helpful context, the way we initialize our grpc server is here: The way we initialize our grpc server can be found here - we pass a ThreadPoolExecutor into a new grpc server object: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_grpc/server.py#L1184-L1192
- we create a ThreadPoolExecutor and pass it into a grpc.server object.
About this issue
- Original URL
- State: closed
- Created 3 months ago
- Comments: 35 (4 by maintainers)
Commits related to this issue
- Add grpcio<1.60.0 pin Summary: While we don't have a conclusive answer to the sporadic reports of grpc server hangs,evidence is mounting to support a pin: - At least one user who was reliably hittin... — committed to dagster-io/dagster by gibsondan 3 months ago
- Add grpcio<1.60.0 pin Summary: While we don't have a conclusive answer to the sporadic reports of grpc server hangs,evidence is mounting to support a pin: - At least one user who was reliably hittin... — committed to dagster-io/dagster by gibsondan 3 months ago
Ok, the issue i have been investigating so far appears to match: https://github.com/googleapis/python-bigtable/issues/949, which will be fixed in the upcoming release of grpcio.
Any of the following mitigations help: