tensorflow: CUDA_ERROR_NO_DEVICE inside docker with GTX Titan X
Running b.gcr.io/tensorflow/tensorflow:latest-gpu
having CUDA 7.0 installed on the host when I try to create the session it returns CUDA_ERROR_NO_DEVICE
and was unable to find libcuda.so DSO loaded into this program
but when the strange thing is that when the module is imported all the libraries are loaded correctly.
Log:
root@5b1e79697b49:~# python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcublas.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcudnn.so.6.5 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcufft.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcurand.so.7.0 locally
>>>
>>> with tf.Session() as sess:
... with tf.device("/gpu:0"):
... matrix1 = tf.constant([[3., 3.]])
... matrix2 = tf.constant([[2.],[2.]])
... product = tf.matmul(matrix1, matrix2)
... sess.run(product)
...
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 8
E tensorflow/stream_executor/cuda/cuda_driver.cc:481] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:114] retrieving CUDA diagnostic information for host: 5b1e79697b49
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:121] hostname: 5b1e79697b49
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:146] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:257] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 358.16 Mon Nov 16 19:25:55 PST 2015
GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:150] kernel reported version is: 358.16
I tensorflow/core/common_runtime/gpu/gpu_init.cc:127] DMA:
I tensorflow/core/common_runtime/direct_session.cc:58] Direct session inter op parallelism threads: 8
Traceback (most recent call last):
File "<stdin>", line 6, in <module>
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 368, in run
results = self._do_run(target_list, unique_fetch_targets, feed_dict_string)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 444, in _do_run
e.code)
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'Const_1': Could not satisfy explicit device specification '/gpu:0'
[[Node: Const_1 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [2,1] values: 2 2>, _device="/gpu:0"]()]]
Caused by op u'Const_1', defined at:
File "<stdin>", line 4, in <module>
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/constant_op.py", line 165, in constant
attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1834, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1043, in __init__
self._traceback = _extract_stack()
>>>
Versions:
- OS:
CentOS Linux elease.2.1511(Core)
- Kernel:
Linux localhost.localdomain 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
- GPU:
NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
- Docker:
1.9.1
Env:
CUDA_HOME=/usr/local/cuda-7.0
LD_LIBRARY_PATH=/usr/local/cuda-7.0/lib64:
Command:
docker run -it -v /usr/lib/x86_64-linux-gnu/libcudadevrt.a:/usr/lib/x86_64-linux-gnu/libcudadevrt.a -v /usr/lib/x86_64-linux-gnu/libcudart.so:/usr/lib/x86_64-linux-gnu/libcudart.so -v /usr/lib/x86_64-linux-gnu/libcudart.so.7.0:/usr/lib/x86_64-linux-gnu/libcudart.so.7.0 -v /usr/lib/x86_64-linux-gnu/libcudart.so.7.0.28:/usr/lib/x86_64-linux-gnu/libcudart.so.7.0.28 -v /usr/lib/x86_64-linux-gnu/libcudart_static.a:/usr/lib/x86_64-linux-gnu/libcudart_static.a -v /usr/lib/x86_64-linux-gnu/libcuda.so:/usr/lib/x86_64-linux-gnu/libcuda.so --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-modeset:/dev/nvidia-modeset b.gcr.io/tensorflow/tensorflow:latest-gpu
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 1
- Comments: 28 (13 by maintainers)
import os
os.environ[“CUDA_DEVICE_ORDER”]=“PCI_BUS_ID”
os.environ[“CUDA_VISIBLE_DEVICES”]=“0”
Here are the debugging steps that I recently went through to get TensorFlow working (kind of) with GPU and Docker. (I say “kind of” because there are still some GPU-related bugs in Tensorflow, which caused some test failures and will likely cause some user-code errors in that regard as well).
See: https://github.com/tensorflow/tensorflow/issues/952 https://github.com/tensorflow/tensorflow/issues/953
That said, here are the things you want to check on:
On the host, outside Docker, you have NVIDIA driver installed. If you have it installed, the binaries “nvidia-smi” and “nvidia-debugdump” ought to be available on the host. Make sure that the following two commands list your GPU:
nvidia-smi
nvidia-debugdump -l
On the host, the output of nvidia-smi tells you the version of the NVIDIA driver installed. It needs to be recent enough for your GPU. For example, on my machine, version 340 doesn’t work, but version 352 does.
On the host, get the CUDA sample code and compile the deviceQuery binary: http://docs.nvidia.com/cuda/cuda-samples/#axzz3z3C3lhk1
For this you’ll need to install the CUDA toolkit, which includes the nvcc compiler and supporting libraries. https://developer.nvidia.com/cuda-downloads
Once the deviceQuery binary is compiled, try to run it
./deviceQuery
If it fails, don’t panic, just try
sudo ./deviceQuery
There are some file permissions issues related to NVIDIA devices in /dev that requires you to do sudo like the above for each boot cycle, for detailed dicussion, see: https://devtalk.nvidia.com/default/topic/749939/cuda-is-not-active-unless-i-run-it-with-sudo-privillages-/
After this make sure that the following files are present in your docker, as they will be used by (the current version of) TensorFlow:
It is okay for NVIDIA driver to be unavailable inside the Docker, even though the CUDA Toolkit is required inside it.
Also, map a bunch of lib files, to make sure that cuda library files are visible inside the container
Now the NVIDIA docker container should be ready to run TensorFlow with GPU.