grpc: gRPC channels blocking indefinitely and not respecting deadlines on network disconnect

What version of gRPC and what language are you using?

gRPC==1.12.1 python

What operating system (Linux, Windows, …) and version?

Linux Ubuntu “14.04.5 LTS, Trusty Tahr”

What runtime / compiler are you using (e.g. python version or version of gcc)

gcc==4:4.8.2-1ubuntu6 python==2.7

What did you do?

I consistently reproduced a bug that causes gRPC channels to block indefinitely when sending to a destination that continuously times out.

I slightly modified the greeter_client.py from the python helloworld example, added an infinite while loop, added a timeout of 0.3 to the SayHello() grpc call and added exception handling to catch any DEADLINE_EXCEEDED errors.

Modified greeter_client.py

Running this script and an un-modified greeter_server.py (on a different host listening on port 6106) I was able to observe the two communicating as expected.

To reproduce the bug in question, I add an iptables rule to the server side to drop all traffic arriving at the port in use: /sbin/iptables -I INPUT 1 -s xx.xxx.xxx.xxx -j DROP

What did you expect to see?

Adding the iptables rule results in lots of DEADLINE_EXCEEDED (timeout) exceptions as expected.

Traceback (most recent call last):
  File "greeter_client.py", line 36, in run
    response = stub.SayHello(helloworld_pb2.HelloRequest(name=msg), timeout=0.3)
  File "/root/.venvs/grpc-test/local/lib/python2.7/site-packages/grpc/_channel.py", line 500, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/root/.venvs/grpc-test/local/lib/python2.7/site-packages/grpc/_channel.py", line 434, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
_Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded)>

I expected to see the greeter_client.py continue iterating over the while loop, timing out in exactly 300ms on every call to response = stub.SayHello(helloworld_pb2.HelloRequest(name=msg), timeout=0.3)

What did you see instead?

If you leave the client running with everything in this state, it eventually blocks indefinitely on the send response = stub.SayHello(helloworld_pb2.HelloRequest(name=msg), timeout=0.3) and never times out. This is an example of the bug we’ve been trying to figure out these past few weeks. After some digging we found that the blocking call here is: https://github.com/grpc/grpc/blob/master/src/python/grpcio/grpc/_channel.py#L508

Once the iptables rule is removed from the server side, the client un-blocks and continues to send.

Now what’s interesting is that if you raise the size of the msg variable here, to say… 'a' * 1000 the client breaks much faster (starts to block indefinitely).

On inspection we discovered that when we add the iptables rule to drop the traffic, the send-q for the socket in use by the grpc client continues to grow as the client continues to attempt to send and timeout. The send-q grows because grpc is buffering the request for sending but never receiving an ack so the data sits there in the queue piling up on every attempt. Until… the send-q seems to reach it’s maximum threshold, at which point the client exhibits the indefinite blocking behaviour. We believe the kernel refuses to add any more data to the send-q causing grpc to block/spin deep within the C code.

For every reproduction of this, when sending msg='a', once the send-q hits 33425 the client blocks.

root@client-server ~ # netstat -natup | grep 6106
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp6       0  33425 xx.xxx.xxx.xxx:45048    xxx.xxx.xx.xxx:6106     ESTABLISHED 19834/python

For larger messages the breaking threshold rises slightly, for example sending msg='a'*1000, the send-q reaches 46278 every time before blocking

root@client-server ~ # netstat -natup | grep 6106
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp6       0  46278 xx.xxx.xxx.xxx:36000    xxx.xxx.xx.xxx:6106     ESTABLISHED 27166/python

Anything else we should know about your project / environment?

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 1
Comments: 34 (21 by maintainers)

Commits related to this issue

client: set TCP_USER_TIMEOUT socket option for linux (#2307) Implements proposal A18 (https://github.com/grpc/proposal/blob/master/A18-tcp-user-timeout.md). gRPC Core issue for reference: https://... — committed to grpc/grpc-go by mikeraimondi 6 years ago

Most upvoted comments

I did a bit more digging into what exactly happens on the kernel side (even if it’s not immediately helpful for addressing the grpc bug itself, but might help provide some context for other people experiencing a similar issue).

The client doesn’t actually hang indefinitely as originally reported, but it will roughly take 15-20 minutes for it to time out.

When connectivity is broken between server and client, the client “succeeds” every time it wants to send data (as long as there’s room in the send queue) so it times out normally when it doesn’t get a reply within the deadline. Once the send queue is full the sendmsg will actually fail (it gets an EAGAIN) and, despite not receiving a response, the client will stay in this state until either the kernel succeeds sending the data (so there’s again room in the send queue) or the connection gets killed (forcing grpc to create a new connection, so sends will start “succeeding” again).

The number of retries the kernel is willing to do before giving up on a connection is governed by net.ipv4.tcp_retries2 (which seems to default to 15). The timeout between retries is calculated dynamically and is based in the RTT of the connection, but we’re looking at around 15-20 minutes overall for all the retries. Once all the retries have been exhausted the kernel will close the connection.

FranGM on Jul 2, 2018

This will be available in 1.16 which will be out in about 6 weeks (since 1.15 just came out). If you want to try it now you can use nightly build pkgs from https://packages.grpc.io/.

srini100 on Sep 14, 2018

We are planning on introducing TCP_USER_TIMEOUT to avoid issues where the tcp packets are not ACKed in reasonable time. That should solve all the edge conditions that we have seen in regards with channels getting blocked indefinitely (and keepalive too).

yashykt on Aug 20, 2018

1.14 release candidates should be published shortly (this week).

kpayson64 on Jul 24, 2018

@hcaseyal, repro with C++ client is here.

srini100 on Jul 8, 2018