tensorflow: device_lib.list_local_devices() InvalidArgumentError: Invalid device ordinal value (1). Valid range is [0, 0].

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): gLinux
TensorFlow installed from (source or binary): pip3 installed
TensorFlow version (use command below): 2.0.0-rc1
Python version: 3.6.2
CUDA/cuDNN version: 10.0, 7.6.3
GPU model and memory:

output of nvidia-smi from the terminal:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.34       Driver Version: 430.34       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P1000        Off  | 00000000:65:00.0  On |                  N/A |
| 37%   52C    P0    N/A /  N/A |   1289MiB /  4037MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN RTX           Off  | 00000000:B3:00.0 Off |                  N/A |
| 41%   29C    P8    14W / 280W |   1155MiB / 24220MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

NOTE: in the above output it shows that it is using CUDA Version: 10. but my LD_LIBRARY_PATH environment variable is pointing to CUDA 10.0.

Snippet of code that cause the problem:

import tensorflow as tf
from tensorflow.python.client import device_lib

device_lib.list_local_devices()

error message:

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-2-b6f1169dc7e5> in <module>
     2 from tensorflow.python.client import device_lib
     3 
---> 4 device_lib.list_local_devices()

~/.local/lib/python3.6/site-packages/tensorflow_core/python/client/device_lib.py in list_local_devices(session_config)
     39   return [
     40       _convert(s)
---> 41       for s in pywrap_tensorflow.list_devices(session_config=session_config)
     42   ]

~/.local/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py in list_devices(session_config)
   2247     return ListDevicesWithSessionConfig(session_config.SerializeToString())
   2248   else:
-> 2249     return ListDevices()
   2250 
   2251 

InvalidArgumentError: Invalid device ordinal value (1). Valid range is [0, 0].
	while setting up XLA_GPU_JIT device number 1

Potential cause and current workaround: In the terminal output I notice that because the Quadro P1000 in my workstation only has 5 multiprocessor and so by default tf will not use it (minimum 8), so I added the following line to my .bashrc

export TF_MIN_GPU_MULTIPROCESSOR_COUNT=5

and run source .bashrc and it works. Another potential solution if I don’t want to set the min GPU Multiprocessor count I can remove the Quadro P1000 from my workstation. I suspect that there is an inconsistency within list_local_devices() that fetch all GPUs in the workstation but didn’t update base on min gpu multiprocessor count rule. So I run another experiment to see if I can reproduce the error after setting TF_MIN_GPU_MULTIPROCESSOR_COUNT to 5

The below code will reproduce the same error:

import tensorflow as tf
from tensorflow.python.client import device_lib

gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
device_lib.list_local_devices()

This will produce the same error but if we call device_lib.list_local_devices() before calling tf.config.experimental.set_visible_devices(gpus[0], 'GPU') and then we call device_lib.list_local_devices() again, there is no error. I suspect that maybe setting device to visible may interact weirdly with list_local_devices().

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 26 (10 by maintainers)

Most upvoted comments

@cheshire Thank you for your reply. How to run under TF_XLA_FLAGS=‘–tf_xla_enable_xla_devices=false’ ? I am not sure where to change this variable.

My promble is solved by:

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0" #(or "1" or "2")

Try 0,1,2 that represent different index of gpu. One of my gpu is not strong enough to train DL model, so I can choose any gpu except this one.

betterze on Feb 12, 2020