grpc: gRPC Python 1.8.2 Library incompatibility with fork()

This bug is essentially the same as the previously-closed bug 12455. That bug was closed because it was fixed in 1.7.0… except we now have at least two ways to repro it in 1.8.2. 😦

Basic summary of the bug: If a gRPC channel is (1) used at least once and (2) still in scope when you call fork(), and the subprocess tries to open a gRPC channel as well, then all RPC’s on that channel will hang. Apparently, some global state is leaking across fork() boundaries in a bad way.

What version of gRPC and what language are you using?

gRPC 1.8.2, Python 3.6

What operating system (Linux, Windows, …) and version?

Repro’ed on OS X 10.13.2

What runtime / compiler are you using (e.g. python version or version of gcc)

Python 3.6.2

What did you do?

Here is a minimal repro using the GCP datastore client:

import multiprocessing

from google.cloud import datastore


def causeTrouble(where: str):
    client = datastore.Client(project='dev-storage-humu', namespace='aquarium')
    client.get(client.key('c', 'aquarium'))
    # The call to get() hangs forever; this line is never reached.
    print('OK')


if __name__ == '__main__':
    # Create a datastore client and do an RPC on it.
    client = datastore.Client(project='dev-storage-humu', namespace='aquarium')
    client.get(client.key('c', 'aquarium'))

    # Kick off a child process while the first client is still in scope.
    process = multiprocessing.Process(target=causeTrouble,
                                      args=['child process'])
    process.start()

If you change this so that it doesn’t call client.get() from main, just creating the client and then forking, it works fine.
If you change this so that instead of creating the client and calling client.get() directly, it calls causeTrouble() before forking, it also works fine. (NB that Python has eager GC, so the difference between the two is that the client is no longer in scope if you do it this other way)
In another server, I worked around this by calling subprocess.run() instead of multiprocessing.Process(); since subprocess() does a fork() + exec(), it has no state surviving across the process boundary. However, the call to exec() strictly limits the communication with the parent process, basically making all of the multiprocessing library unusable.

If you kill it while it’s hanging, you get this stack trace:

Traceback (most recent call last):
  File "/Users/zunger/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/zunger/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "minimal_repro.py", line 8, in causeTrouble
    client.get(client.key('c', 'aquarium'))
  File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/cloud/datastore/client.py", line 309, in get
    deferred=deferred, transaction=transaction)
  File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/cloud/datastore/client.py", line 356, in get_multi
    transaction_id=transaction and transaction.id,
  File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/cloud/datastore/client.py", line 138, in _extended_lookup
    project, read_options, key_pbs)
  File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/cloud/datastore/_gax.py", line 115, in lookup
    return super(GAPICDatastoreAPI, self).lookup(*args, **kwargs)
  File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/cloud/gapic/datastore/v1/datastore_client.py", line 204, in lookup
    return self._lookup(request, options)
  File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/gax/api_callable.py", line 452, in inner
    return api_caller(api_call, this_settings, request)
  File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/gax/api_callable.py", line 438, in base_caller
    return api_call(*args)
  File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/gax/api_callable.py", line 376, in inner
    return a_func(*args, **kwargs)
  File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/gax/retry.py", line 121, in inner
    return to_call(*args)
  File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/gax/retry.py", line 68, in inner
    return a_func(*updated_args, **kwargs)
  File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/grpc/_channel.py", line 484, in __call__
    credentials)
  File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/grpc/_channel.py", line 478, in _blocking
    _handle_event(completion_queue.poll(), state,
  File "src/python/grpcio/grpc/_cython/_cygrpc/completion_queue.pyx.pxi", line 100, in grpc._cython.cygrpc.CompletionQueue.poll

On b/12455, @katbusch reported that simply importing certain libraries prior to fork (e.g. from google.cloud import bigquery) was sufficient to trigger this bug. Presumably they do enough initialization on import to trigger this.

That makes workarounds where we pre-fork a pile of worker processes and pass them jobs as needed more difficult, although possible as a short-term measure.

Anything else we should know about your project / environment?

The use case which necessitates this is that we have a server (which uses GCP features extensively, e.g. for storage) that needs to fork off subprocesses in which to run long-running operations. (It’s basically an analysis pipeline manager) Since Python is heavily reliant on fork() for parallelization (thanks to the GIL there’s no real ‘parallelism’ in its threading) this is the main approach possible.

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 7
Comments: 26 (5 by maintainers)

Most upvoted comments

To summarize the issue described here: Creating an RPC channel and then forking will not work. This is a known limitation, and while we would like to support this use case, there are no immediate plans to support this behavior.

Closing this issue as there are no immediate plans to support this.

+10

kpayson64 on Jan 17, 2018

Oh my god what is this nonsense

Where is this documented?

Does the environment variable work with other versions or only 1.8.1?

remram44 on Feb 5, 2018

Thanks on behalf of all of us watching this ticket, that was closed back in May basically as “wontfix”.

remram44 on Oct 18, 2018

v1.15.0 solves this – updating this ticket to save others three or four clicks

jeffreybrowning on Oct 17, 2018

I hope this is fixed soon, not supporting multiprocessing in Python bindings makes gRPC (and Google Cloud SDK, which relies on it) basically useless in Python. Multi-core processing is required for a lot of data and CPU-intensive algorithms, e.g. machine learning. A lot of widely used libraries (SciKit-Learn, pandas etc) don’t release GIL, which makes fork/exec/multiprocessing the only option. @kpayson64 please consider reopening the issue. Thanks.

MaxDesiatov on Jan 18, 2018