grpc: gRPC channels blocking indefinitely and not respecting deadlines on network disconnect
What version of gRPC and what language are you using?
gRPC==1.12.1 python
What operating system (Linux, Windows, …) and version?
Linux Ubuntu “14.04.5 LTS, Trusty Tahr”
What runtime / compiler are you using (e.g. python version or version of gcc)
gcc==4:4.8.2-1ubuntu6 python==2.7
What did you do?
I consistently reproduced a bug that causes gRPC channels to block indefinitely when sending to a destination that continuously times out.
I slightly modified the greeter_client.py
from the python helloworld example, added an infinite while loop, added a timeout of 0.3
to the SayHello()
grpc call and added exception handling to catch any DEADLINE_EXCEEDED
errors.
Running this script and an un-modified greeter_server.py
(on a different host listening on port 6106) I was able to observe the two communicating as expected.
To reproduce the bug in question, I add an iptables rule to the server side to drop all traffic arriving at the port in use:
/sbin/iptables -I INPUT 1 -s xx.xxx.xxx.xxx -j DROP
What did you expect to see?
Adding the iptables rule results in lots of DEADLINE_EXCEEDED
(timeout) exceptions as expected.
Traceback (most recent call last):
File "greeter_client.py", line 36, in run
response = stub.SayHello(helloworld_pb2.HelloRequest(name=msg), timeout=0.3)
File "/root/.venvs/grpc-test/local/lib/python2.7/site-packages/grpc/_channel.py", line 500, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/root/.venvs/grpc-test/local/lib/python2.7/site-packages/grpc/_channel.py", line 434, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
_Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded)>
I expected to see the greeter_client.py
continue iterating over the while
loop, timing out in exactly 300ms on every call to response = stub.SayHello(helloworld_pb2.HelloRequest(name=msg), timeout=0.3)
What did you see instead?
If you leave the client running with everything in this state, it eventually blocks indefinitely on the send response = stub.SayHello(helloworld_pb2.HelloRequest(name=msg), timeout=0.3)
and never times out. This is an example of the bug we’ve been trying to figure out these past few weeks. After some digging we found that the blocking call here is: https://github.com/grpc/grpc/blob/master/src/python/grpcio/grpc/_channel.py#L508
Once the iptables rule is removed from the server side, the client un-blocks and continues to send.
Now what’s interesting is that if you raise the size of the msg
variable here, to say… 'a' * 1000
the client breaks much faster (starts to block indefinitely).
On inspection we discovered that when we add the iptables rule to drop the traffic, the send-q
for the socket in use by the grpc client continues to grow as the client continues to attempt to send and timeout. The send-q
grows because grpc is buffering the request for sending but never receiving an ack so the data sits there in the queue piling up on every attempt. Until… the send-q
seems to reach it’s maximum threshold, at which point the client exhibits the indefinite blocking behaviour. We believe the kernel refuses to add any more data to the send-q
causing grpc to block/spin deep within the C code.
For every reproduction of this, when sending msg='a'
, once the send-q
hits 33425
the client blocks.
root@client-server ~ # netstat -natup | grep 6106
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp6 0 33425 xx.xxx.xxx.xxx:45048 xxx.xxx.xx.xxx:6106 ESTABLISHED 19834/python
For larger messages the breaking threshold rises slightly, for example sending msg='a'*1000
, the send-q
reaches 46278
every time before blocking
root@client-server ~ # netstat -natup | grep 6106
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp6 0 46278 xx.xxx.xxx.xxx:36000 xxx.xxx.xx.xxx:6106 ESTABLISHED 27166/python
Anything else we should know about your project / environment?
No
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 1
- Comments: 34 (21 by maintainers)
I did a bit more digging into what exactly happens on the kernel side (even if it’s not immediately helpful for addressing the grpc bug itself, but might help provide some context for other people experiencing a similar issue).
The client doesn’t actually hang indefinitely as originally reported, but it will roughly take 15-20 minutes for it to time out.
When connectivity is broken between server and client, the client “succeeds” every time it wants to send data (as long as there’s room in the send queue) so it times out normally when it doesn’t get a reply within the deadline. Once the send queue is full the
sendmsg
will actually fail (it gets anEAGAIN
) and, despite not receiving a response, the client will stay in this state until either the kernel succeeds sending the data (so there’s again room in the send queue) or the connection gets killed (forcing grpc to create a new connection, so sends will start “succeeding” again).The number of retries the kernel is willing to do before giving up on a connection is governed by
net.ipv4.tcp_retries2
(which seems to default to15
). The timeout between retries is calculated dynamically and is based in theRTT
of the connection, but we’re looking at around 15-20 minutes overall for all the retries. Once all the retries have been exhausted the kernel will close the connection.This will be available in 1.16 which will be out in about 6 weeks (since 1.15 just came out). If you want to try it now you can use nightly build pkgs from https://packages.grpc.io/.
We are planning on introducing TCP_USER_TIMEOUT to avoid issues where the tcp packets are not ACKed in reasonable time. That should solve all the edge conditions that we have seen in regards with channels getting blocked indefinitely (and keepalive too).
1.14 release candidates should be published shortly (this week).
@hcaseyal, repro with C++ client is here.