tensorflow: Segmentation fault when GPUs are already used
When I set the nvidia driver in exclusive mode and one of the GPU is already used by another process, I get a segmentation fault:
$ python -c "import tensorflow as tf;tf.InteractiveSession()"
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 12
Segmentation fault (core dumped)
If I limit the visible GPUs to only GPUs that have nothing running on them, it don’t segfault:
$CUDA_VISIBLE_DEVICES=1 python -c "import tensorflow as tf;tf.InteractiveSession()"
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 12
I tensorflow/core/common_runtime/gpu/gpu_init.cc:88] Found device 0 with properties:
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:09:00.0
Total memory: 12.00GiB
Free memory: 11.87GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:122] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 12105628263
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 12
Here is my output of nvidia-smi:
$ nvidia-smi
Wed Nov 11 16:48:27 2015
+------------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 750 Off | 0000:05:00.0 On | N/A |
| N/A 48C P8 0W / 38W | 25MiB / 2047MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:06:00.0 Off | N/A |
| 42% 82C P2 127W / 250W | 262MiB / 12287MiB | 39% E. Process |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:09:00.0 Off | N/A |
| 22% 46C P8 17W / 250W | 23MiB / 12287MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX TIT... Off | 0000:0A:00.0 Off | N/A |
| 22% 37C P8 15W / 250W | 361MiB / 12287MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1171 G /usr/bin/X 17MiB |
| 1 32740 C python 209MiB |
| 3 9429 C python 336MiB |
+-----------------------------------------------------------------------------+
I suppose that in the code that check the available GPUs, it don’t handle correctly a case when one of the GPUs can’t be used.
About this issue
- Original URL
- State: closed
- Created 9 years ago
- Reactions: 6
- Comments: 42 (16 by maintainers)
Commits related to this issue
- Removed extra parens (#152) fixes b/28821889 — committed to tarasglek/tensorflow by itsmeolivia 8 years ago
- [OpenCL] Removes unneeded device to host copy (#152) The SYCL ApplyGradientDescent operation does not need to copy the scalar `lr` back to the host, but can use broadcasting to read the scalar on t... — committed to codeplaysoftware/tensorflow by jwlawson 7 years ago
@nouiz, you are correct in your analysis. TensorFlow tries to grab all the GPU it sees from the system that passes its criteria. CUDA_VISIBLE_DEVICES is the solution I would suggest. Please let us know if that is not enough.
@yaroslavvb Looking at the NVIDIA docs more, I think you can get your setup_one_gpu script to work reliably if you set the CUDA_DEVICE_ORDER environment variable to ‘PCI_BUS_ID’.
@zheng-xq A cheezy solution that would probably reduce the likelihood of complaints (but create some less-likely new problems) would be to force set CUDA_DEVICE_ORDER=‘PCI_BUS_ID’ within Tensorflow.