grpc: gRPC Python 1.8.2 Library incompatibility with fork()
This bug is essentially the same as the previously-closed bug 12455. That bug was closed because it was fixed in 1.7.0… except we now have at least two ways to repro it in 1.8.2. 😦
Basic summary of the bug: If a gRPC channel is (1) used at least once and (2) still in scope when you call fork(), and the subprocess tries to open a gRPC channel as well, then all RPC’s on that channel will hang. Apparently, some global state is leaking across fork() boundaries in a bad way.
What version of gRPC and what language are you using?
gRPC 1.8.2, Python 3.6
What operating system (Linux, Windows, …) and version?
Repro’ed on OS X 10.13.2
What runtime / compiler are you using (e.g. python version or version of gcc)
Python 3.6.2
What did you do?
Here is a minimal repro using the GCP datastore client:
import multiprocessing
from google.cloud import datastore
def causeTrouble(where: str):
client = datastore.Client(project='dev-storage-humu', namespace='aquarium')
client.get(client.key('c', 'aquarium'))
# The call to get() hangs forever; this line is never reached.
print('OK')
if __name__ == '__main__':
# Create a datastore client and do an RPC on it.
client = datastore.Client(project='dev-storage-humu', namespace='aquarium')
client.get(client.key('c', 'aquarium'))
# Kick off a child process while the first client is still in scope.
process = multiprocessing.Process(target=causeTrouble,
args=['child process'])
process.start()
- If you change this so that it doesn’t call client.get() from main, just creating the client and then forking, it works fine.
- If you change this so that instead of creating the client and calling client.get() directly, it calls causeTrouble() before forking, it also works fine. (NB that Python has eager GC, so the difference between the two is that the client is no longer in scope if you do it this other way)
- In another server, I worked around this by calling subprocess.run() instead of multiprocessing.Process(); since subprocess() does a fork() + exec(), it has no state surviving across the process boundary. However, the call to exec() strictly limits the communication with the parent process, basically making all of the multiprocessing library unusable.
If you kill it while it’s hanging, you get this stack trace:
Traceback (most recent call last):
File "/Users/zunger/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/Users/zunger/.pyenv/versions/3.6.2/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "minimal_repro.py", line 8, in causeTrouble
client.get(client.key('c', 'aquarium'))
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/cloud/datastore/client.py", line 309, in get
deferred=deferred, transaction=transaction)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/cloud/datastore/client.py", line 356, in get_multi
transaction_id=transaction and transaction.id,
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/cloud/datastore/client.py", line 138, in _extended_lookup
project, read_options, key_pbs)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/cloud/datastore/_gax.py", line 115, in lookup
return super(GAPICDatastoreAPI, self).lookup(*args, **kwargs)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/cloud/gapic/datastore/v1/datastore_client.py", line 204, in lookup
return self._lookup(request, options)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/gax/api_callable.py", line 452, in inner
return api_caller(api_call, this_settings, request)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/gax/api_callable.py", line 438, in base_caller
return api_call(*args)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/gax/api_callable.py", line 376, in inner
return a_func(*args, **kwargs)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/gax/retry.py", line 121, in inner
return to_call(*args)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/google/gax/retry.py", line 68, in inner
return a_func(*updated_args, **kwargs)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/grpc/_channel.py", line 484, in __call__
credentials)
File "/Users/zunger/src/humu/servers/squeegee/server/.build/venv/lib/python3.6/site-packages/grpc/_channel.py", line 478, in _blocking
_handle_event(completion_queue.poll(), state,
File "src/python/grpcio/grpc/_cython/_cygrpc/completion_queue.pyx.pxi", line 100, in grpc._cython.cygrpc.CompletionQueue.poll
On b/12455, @katbusch reported that simply importing certain libraries prior to fork (e.g. from google.cloud import bigquery
) was sufficient to trigger this bug. Presumably they do enough initialization on import to trigger this.
That makes workarounds where we pre-fork a pile of worker processes and pass them jobs as needed more difficult, although possible as a short-term measure.
Anything else we should know about your project / environment?
The use case which necessitates this is that we have a server (which uses GCP features extensively, e.g. for storage) that needs to fork off subprocesses in which to run long-running operations. (It’s basically an analysis pipeline manager) Since Python is heavily reliant on fork() for parallelization (thanks to the GIL there’s no real ‘parallelism’ in its threading) this is the main approach possible.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 7
- Comments: 26 (5 by maintainers)
To summarize the issue described here: Creating an RPC channel and then forking will not work. This is a known limitation, and while we would like to support this use case, there are no immediate plans to support this behavior.
Closing this issue as there are no immediate plans to support this.
Oh my god what is this nonsense
Where is this documented?
Does the environment variable work with other versions or only
1.8.1
?Thanks on behalf of all of us watching this ticket, that was closed back in May basically as “wontfix”.
v1.15.0 solves this – updating this ticket to save others three or four clicks
I hope this is fixed soon, not supporting multiprocessing in Python bindings makes gRPC (and Google Cloud SDK, which relies on it) basically useless in Python. Multi-core processing is required for a lot of data and CPU-intensive algorithms, e.g. machine learning. A lot of widely used libraries (SciKit-Learn, pandas etc) don’t release GIL, which makes fork/exec/multiprocessing the only option. @kpayson64 please consider reopening the issue. Thanks.