tensorflow: device_lib.list_local_devices() InvalidArgumentError: Invalid device ordinal value (1). Valid range is [0, 0].
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): gLinux
- TensorFlow installed from (source or binary): pip3 installed
- TensorFlow version (use command below): 2.0.0-rc1
- Python version: 3.6.2
- CUDA/cuDNN version: 10.0, 7.6.3
- GPU model and memory:
output of nvidia-smi from the terminal:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.34 Driver Version: 430.34 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P1000 Off | 00000000:65:00.0 On | N/A |
| 37% 52C P0 N/A / N/A | 1289MiB / 4037MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN RTX Off | 00000000:B3:00.0 Off | N/A |
| 41% 29C P8 14W / 280W | 1155MiB / 24220MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
NOTE: in the above output it shows that it is using CUDA Version: 10. but my LD_LIBRARY_PATH environment variable is pointing to CUDA 10.0.
Snippet of code that cause the problem:
import tensorflow as tf
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
error message:
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<ipython-input-2-b6f1169dc7e5> in <module>
2 from tensorflow.python.client import device_lib
3
---> 4 device_lib.list_local_devices()
~/.local/lib/python3.6/site-packages/tensorflow_core/python/client/device_lib.py in list_local_devices(session_config)
39 return [
40 _convert(s)
---> 41 for s in pywrap_tensorflow.list_devices(session_config=session_config)
42 ]
~/.local/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py in list_devices(session_config)
2247 return ListDevicesWithSessionConfig(session_config.SerializeToString())
2248 else:
-> 2249 return ListDevices()
2250
2251
InvalidArgumentError: Invalid device ordinal value (1). Valid range is [0, 0].
while setting up XLA_GPU_JIT device number 1
Potential cause and current workaround:
In the terminal output I notice that because the Quadro P1000 in my workstation only has 5 multiprocessor and so by default tf will not use it (minimum 8), so I added the following line to my .bashrc
export TF_MIN_GPU_MULTIPROCESSOR_COUNT=5
and run source .bashrc and it works. Another potential solution if I don’t want to set the min GPU Multiprocessor count I can remove the Quadro P1000 from my workstation. I suspect that there is an inconsistency within list_local_devices() that fetch all GPUs in the workstation but didn’t update base on min gpu multiprocessor count rule. So I run another experiment to see if I can reproduce the error after setting TF_MIN_GPU_MULTIPROCESSOR_COUNT to 5
The below code will reproduce the same error:
import tensorflow as tf
from tensorflow.python.client import device_lib
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
device_lib.list_local_devices()
This will produce the same error but if we call device_lib.list_local_devices() before calling tf.config.experimental.set_visible_devices(gpus[0], 'GPU') and then we call device_lib.list_local_devices() again, there is no error. I suspect that maybe setting device to visible may interact weirdly with list_local_devices().
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 26 (10 by maintainers)
@cheshire Thank you for your reply. How to run under TF_XLA_FLAGS=‘–tf_xla_enable_xla_devices=false’ ? I am not sure where to change this variable.
My promble is solved by:
Try 0,1,2 that represent different index of gpu. One of my gpu is not strong enough to train DL model, so I can choose any gpu except this one.