tensorflow: Session hang issue with python multiprocessing
Issue summary
I am having trouble allocating GPU devices for a multiprocessing pool. Please see the short code reproduction below. I would like to understand why I am getting the CUDA_ERROR_NOT_INITIALIZED error in case 4. For this case, the program hangs, and I have to stop my docker container to exit.
Minimal reproducible example
core code:
import tensorflow as tf
def run_session(device):
gpu_options = tf.GPUOptions(allow_growth=True, visible_device_list=device)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
print('Using device #%s' % device)
a = tf.placeholder(tf.int16, name='a')
y = tf.identity(a, name='y')
print sess.run(y, feed_dict={a: 3})
sess.close()
print('Done.')
Case 1 (this works fine):
run_session('0')
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:08:00.0
Total memory: 5.97GiB
Free memory: 5.86GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:08:00.0)
Using device #0
3
Done.
Case 2 (this works fine):
run_session('0')
run_session('1')
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:08:00.0
Total memory: 5.97GiB
Free memory: 5.86GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:08:00.0)
Using device #0
3
Done.
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x24cbbe0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:84:00.0
Total memory: 5.97GiB
Free memory: 5.86GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 1, name: GeForce GTX 980 Ti, pci bus id: 0000:84:00.0)
Using device #1
3
Done.
Case 3 (this works fine):
import multiprocessing as mp
p = mp.Pool(2)
p.map(run_session, ['0', '1'])
p.close()
p.join()
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:84:00.0
Total memory: 5.97GiB
Free memory: 5.86GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 1, name: GeForce GTX 980 Ti, pci bus id: 0000:84:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:08:00.0
Total memory: 5.97GiB
Free memory: 5.86GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:08:00.0)
Using device #1
Using device #0
3
Done.
3
Done.
Case 4 (here, the program hangs):
import multiprocessing as mp
run_session('0')
p = mp.Pool(2)
p.map(run_session, ['0', '1'])
p.close()
p.join()
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:08:00.0
Total memory: 5.97GiB
Free memory: 5.86GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:08:00.0)
Using device #0
3
Done.
E tensorflow/stream_executor/cuda/cuda_driver.cc:1368] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED
Using device #0
E tensorflow/stream_executor/cuda/cuda_driver.cc:1368] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED
Using device #1
Environment info
Operating System: Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-25-generic x86_64) Docker container: gcr.io/tensorflow/tensorflow:latest-devel-gpu CUDA version: 8.0.61 cuDNN version: 5.1.10
Related GitHub issues
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 49
- Comments: 23 (1 by maintainers)
Commits related to this issue
- Demo for multiprocessing with HanLP and TensorFlow https://github.com/tensorflow/tensorflow/issues/8220#issuecomment-302826884 — committed to hankcs/HanLP by hankcs 4 years ago
@suharshs Python multiprocessing works fine with tensorflow. The only thing should be noticed is that tensorflow must be imported independently inside each process (must use multiprocessing instead of multithreading since tensorflow will take over the entire process). Below is how I achieved multi-GPU and multiprocessing inferencing and I hope it helps:
The python multiprocessing package seems to just call fork when creating a child process. This cannot work when the child process calls async code (i.e TensorFlow is multithreaded). From the posix spec for fork:
So long story short, don’t use python multiprocessing for anything non-trivial and expect it to work 😃
Hi I had the same issue today, but this problem can be resolved by putting
import tensorflow as tfinside your worker function (and the result is well parallelised).Update: I have looked into this a bit more, and have a couple more interesting repro cases 😃 Works:
Hangs:
It looks like there is some shared python tensorflow state that interferes when a new python process is created (multiprocessing creates new python process whose state separation i am not to clear on). I plan to look into it very soon, but just wanted to provide an update in case that gives you any workarounds.
I am also running into this issue. Multiprocessing works unless I first run a session in the parent thread. I’ve tried moving the “import tensorflow” statement to the function as @Lancerchiang suggested with no luck. Below is my minimal repro with 4 test cases.
@breckuh If you really need to run a
tensorflowsession in your parent process, my advice is that launching explicit child processes like I did above instead of using pool mapping, and importtensorflowin your parent process after you have done that in your child processes.It would give:
You can see the three tensorflow sessions finished successfully.
I just hit the same issue when using celery worker to run tensorflow gpu. I this issue solved?
That would be very slow when inference, since import TensorFlow and load model consume seconds
Thanks @Lancerchiang that makes sense. I don’t actually know if we’ll ever have this use case in practice, it only came up because our test suite was failing when certain tests were run in different orders. Then we fell down a rabbit hole isolating this 😃. In the end we just had the workaround where we specifically arranged our suite to run the tests in the child processes first, and then the tests in the parent processes after. Not ideal but good enough for now. What I would like to do is add a line or two to check if this hang might hit and then Throw/Alert the user, so no one is left hanging.