tensorflow: Segmentation fault when GPUs are already used

When I set the nvidia driver in exclusive mode and one of the GPU is already used by another process, I get a segmentation fault:

$ python -c "import tensorflow as tf;tf.InteractiveSession()"
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 12
Segmentation fault (core dumped)

If I limit the visible GPUs to only GPUs that have nothing running on them, it don’t segfault:

$CUDA_VISIBLE_DEVICES=1 python -c "import tensorflow as tf;tf.InteractiveSession()"
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 12
I tensorflow/core/common_runtime/gpu/gpu_init.cc:88] Found device 0 with properties: 
name: GeForce GTX TITAN X
major: 5 minor: 2 memoryClockRate (GHz) 1.076
pciBusID 0000:09:00.0
Total memory: 12.00GiB
Free memory: 11.87GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:122] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:09:00.0)
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 12105628263
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 12

Here is my output of nvidia-smi:

$ nvidia-smi 
Wed Nov 11 16:48:27 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 750     Off  | 0000:05:00.0      On |                  N/A |
| N/A   48C    P8     0W /  38W |     25MiB /  2047MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:06:00.0     Off |                  N/A |
| 42%   82C    P2   127W / 250W |    262MiB / 12287MiB |     39%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 0000:09:00.0     Off |                  N/A |
| 22%   46C    P8    17W / 250W |     23MiB / 12287MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 0000:0A:00.0     Off |                  N/A |
| 22%   37C    P8    15W / 250W |    361MiB / 12287MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1171    G   /usr/bin/X                                      17MiB |
|    1     32740    C   python                                         209MiB |
|    3      9429    C   python                                         336MiB |
+-----------------------------------------------------------------------------+

I suppose that in the code that check the available GPUs, it don’t handle correctly a case when one of the GPUs can’t be used.

About this issue

Original URL
State: closed
Created 9 years ago
Reactions: 6
Comments: 42 (16 by maintainers)

Commits related to this issue

Removed extra parens (#152) fixes b/28821889 — committed to tarasglek/tensorflow by itsmeolivia 8 years ago
[OpenCL] Removes unneeded device to host copy (#152) The SYCL ApplyGradientDescent operation does not need to copy the scalar `lr` back to the host, but can use broadcasting to read the scalar on t... — committed to codeplaysoftware/tensorflow by jwlawson 7 years ago

Most upvoted comments

@nouiz, you are correct in your analysis. TensorFlow tries to grab all the GPU it sees from the system that passes its criteria. CUDA_VISIBLE_DEVICES is the solution I would suggest. Please let us know if that is not enough.

zheng-xq on Nov 11, 2015

@yaroslavvb Looking at the NVIDIA docs more, I think you can get your setup_one_gpu script to work reliably if you set the CUDA_DEVICE_ORDER environment variable to ‘PCI_BUS_ID’.

@zheng-xq A cheezy solution that would probably reduce the likelihood of complaints (but create some less-likely new problems) would be to force set CUDA_DEVICE_ORDER=‘PCI_BUS_ID’ within Tensorflow.

cbquillen on Jan 19, 2017