grpc: Python client hangs on first connection, on Ubuntu Server 18 and Ubuntu Server 20 but not on Ubuntu Server 16.

What version of gRPC and what language are you using?

gRPC: v1.31.0, Python

What operating system (Linux, Windows,…) and version?

Ubuntu Server 16.04, Ubuntu Server 18.04, Ubuntu Server 20.04, Windows 10 Enterprise

What runtime / compiler are you using (e.g. python version or version of gcc)

gcc (Ubuntu 5.4.0-6ubuntu1 16.04.12) 5.4.0 20160609 - Python 3.5.2 gcc (Ubuntu 7.5.0-3ubuntu1 18.04) 7.5.0 - Python 3.6.9 gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0 - Python 3.8.2 N/A - Python 3.8.5

What did you do?

I ran the gRPC Quick Start tutorial in Python on 4 separate virtual machines. The machines are Ubuntu Server 16, 18, 20, and Windows 10 Enterprise.

What did you expect to see?

I expect a simple hello world client-server example to execute in less than ~1 second.

What did you see instead?

For the Ubuntu Server 16 and Windows 10 Enterprise this is the case. However for Ubuntu server 18 and 20 the simple grpc/examples/python/helloworld/greeter_client.py takes around 45 seconds to respond.

Given the nature of this problem (no error is encountered and hence no error trace) there is not much in the form of output. The program completes successfully, but in a period of time that is troublesome. So, if you introduce an error via a KeyboardInterrupt while the client is hanging you will receive:

  File "greeter_client.py", line 37, in <module>
    run()
  File "greeter_client.py", line 31, in run
    response = stub.SayHello(helloworld_pb2.HelloRequest(name='you'))
  File "/home/james/.local/lib/python3.8/site-packages/grpc/_channel.py", line 824, in __call__
    state, call, = self._blocking(request, timeout, metadata, credentials,
  File "/home/james/.local/lib/python3.8/site-packages/grpc/_channel.py", line 813, in _blocking
    event = call.next_event()
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 338, in grpc._cython.cygrpc.SegregatedCall.next_event
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 169, in grpc._cython.cygrpc._next_call_event
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 163, in grpc._cython.cygrpc._next_call_event
  File "src/python/grpcio/grpc/_cython/_cygrpc/completion_queue.pyx.pxi", line 63, in grpc._cython.cygrpc._latent_event
  File "src/python/grpcio/grpc/_cython/_cygrpc/completion_queue.pyx.pxi", line 42, in grpc._cython.cygrpc._next
KeyboardInterrupt

Some additional notes:

I have also ran the Python client and Go server combination and the issue is still present. I then ran the Go client with Python server and there is no issue. I run the Python client and JavaScript server and the issue is present. I run JavaScript client and Python server and there is no issue. Therefore, the issue may be said to exist with the Python client.

I have tested all of these using Virtual Box. I have also ran tests locally in a debugger and set break points where the code is hanging on (see HANGS HERE ASSIGNING response OBJECT):

from __future__ import print_function
import logging

import grpc

import helloworld_pb2
import helloworld_pb2_grpc


def run():
    # NOTE(gRPC Python Team): .close() is possible on a channel and should be
    # used in circumstances in which the with statement does not fit the needs
    # of the code.
    with grpc.insecure_channel('localhost:50051') as channel: # added break
        stub = helloworld_pb2_grpc.GreeterStub(channel) # added break
        response = stub.SayHello(helloworld_pb2.HelloRequest(name='you')) # added break - HANGS HERE ASSIGNING `response` OBJECT
    print("Greeter client received: " + response.message) # added break


if __name__ == '__main__':
    logging.basicConfig()
    run()

Something interesting to note above is that adding additional response = stub.SayHello(helloworld_pb2.HelloRequest(name='you')) and calling the server multiple times makes no noticeable difference to execution time. What I mean here is that the issue seems to be a connection issue, once connected subsequent calls to execute immediately. This behavior may be verified with the route_guide example where multiple calls are made from the route_guide_client.py and it is obvious only the first call hangs on Ubuntu Server 20. I also followed a random online tutorial which echos a message and the client I wrote “from scratch” was still hanging on first connect.

In my estimation the root of the issue is (broadly) either a Python problem or a Linux problem (since the gRPC version is never changed from v1.31.0). So in an attempt to rule one of these out, I upgraded the Python version on the Ubuntu Server 16 box (recall there the gRPC Python Quick Start is working as expectred) to Python 3.7 (which was built from source). There is no issue, which leads me to believe there was some change introduced in Ubuntu 18 and still present in Ubuntu 20 which causes the initial connection to hang instead of connecting in a reasonable time. To complicate matters further, I ran a similar test in which I completed the gRPC Python Quick Start on an Ubuntu 20 VM hosted on GCP and there was NO issue present. Although I am confident it will be reproducible running Virtual Box on Ubuntu 20 Desktop; since with the right software combination on my local machine I can get the Python client code to either run without hanging or hang upon first connection.

Anything else we should know about your project / environment?

I am running Virtual Box 6.1 on Ubuntu Desktop 20.04. I downloaded all Ubuntu .iso images from Canonical this afternoon (Aug 28, 2020) including Ubuntu 20 Desktop and performed all of these tests on the host with a “fresh”, minimal install of Ubuntu 20 Desktop. The Windows image is the official developer environment .ova file from Microsoft.

Thanks,

James

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 2
  • Comments: 15 (8 by maintainers)

Most upvoted comments

The noteworthy thing from the log is:

D0831 13:10:45.577299339    6449 grpc_ares_wrapper.cc:285]   (c-ares resolver) request:0x55eaa9d99b80 on_hostbyname_done_locked: C-ares status is not ARES_SUCCESS qtype=AAAA name=localhost is_balancer=0: Timeout while
 contacting DNS servers

I’m guessing that the IPv4 localhost resolution worked instantaneously because it probably read directly from the local /etc/hosts file. However, the IPv6 answer was not in /etc/hosts and hence the query to the DNS server. The strange thing is why c-ares was unable to receive and responses from the DNS server that it discovered, though. Likely things got tripped up on either: a) c-ares resolved the incorrect DNS server address from the local configuration b) c-ares resolved the correct address of the the local DNS server, but the DNS server or the network that it’s on is for some reason unavailable.

@jameOne if you have a packet trace, e.g.:

tcpdump -i <name of primary network interface here> 'port 53'

that may also give some insight

It may also be interesting if you could share the contents of your /etc/resolve.conf files.

@jameOne It seems the resolver is having trouble with localhost:50051.

  1. Can you check if localhost is set properly in your environment?
  2. Can you try a different resolver by setting environment variable GRPC_DNS_RESOLVER=native?
D0831 13:09:27.577593604    6445 dns_resolver_ares.cc:184]   (c-ares resolver) resolver:0x55eaa9dbd8a0 AresDnsResolver::StartLocked() is called.
D0831 13:09:27.577596674    6445 grpc_ares_wrapper.cc:645]   (c-ares resolver) request:0x55eaa9d99b80 c-ares grpc_dns_lookup_ares_locked_impl name=localhost:50051, default_port=https
D0831 13:09:27.577653493    6445 grpc_ares_ev_driver.cc:158] (c-ares resolver) request:0x55eaa9d99b80 grpc_ares_ev_driver_create_locked
D0831 13:09:27.577679913    6445 grpc_ares_wrapper.cc:200]   (c-ares resolver) request:0x55eaa9d99b80 create_hostbyname_request_locked host:localhost port:33731 is_balancer:0 qtype:AAAA
D0831 13:09:27.577725082    6445 grpc_ares_wrapper.cc:200]   (c-ares resolver) request:0x55eaa9d99b80 create_hostbyname_request_locked host:localhost port:33731 is_balancer:0 qtype:A
D0831 13:09:27.577736032    6445 grpc_ares_wrapper.cc:227]   (c-ares resolver) request:0x55eaa9d99b80 on_hostbyname_done_locked qtype=A host=localhost ARES_SUCCESS
D0831 13:09:27.577740632    6445 grpc_ares_wrapper.cc:273]   (c-ares resolver) request:0x55eaa9d99b80 c-ares resolver gets a AF_INET result: 
  addr: 127.0.0.1
  port: 50051

D0831 13:09:27.577745522    6445 grpc_ares_ev_driver.cc:392] (c-ares resolver) request:0x55eaa9d99b80 new fd: c-ares fd: 10
D0831 13:09:27.577748092    6445 grpc_ares_ev_driver.cc:98]  (c-ares resolver) request:0x55eaa9d99b80 Ref ev_driver 0x55eaa9cfd900
D0831 13:09:27.577750552    6445 grpc_ares_ev_driver.cc:406] (c-ares resolver) request:0x55eaa9d99b80 notify read on: c-ares fd: 10
D0831 13:09:27.577753342    6445 grpc_ares_ev_driver.cc:463] (c-ares resolver) request:0x55eaa9d99b80 ev_driver=0x55eaa9cfd900 grpc_ares_ev_driver_start_locked. timeout in 120000 ms
D0831 13:09:27.577755682    6445 grpc_ares_ev_driver.cc:98]  (c-ares resolver) request:0x55eaa9d99b80 Ref ev_driver 0x55eaa9cfd900
D0831 13:09:27.577763612    6445 grpc_ares_ev_driver.cc:227] (c-ares resolver) request:0x55eaa9d99b80 ev_driver=0x55eaa9cfd900. next ares process poll time in 1000 ms
D0831 13:09:27.577765852    6445 grpc_ares_ev_driver.cc:98]  (c-ares resolver) request:0x55eaa9d99b80 Ref ev_driver 0x55eaa9cfd900
D0831 13:09:27.577773282    6445 dns_resolver_ares.cc:448]   (c-ares resolver) resolver:0x55eaa9dbd8a0 Started resolving. pending_request_:0x55eaa9d99b80