tensorflow: CUDA_ERROR_NO_DEVICE inside docker with GTX Titan X

Running b.gcr.io/tensorflow/tensorflow:latest-gpu having CUDA 7.0 installed on the host when I try to create the session it returns CUDA_ERROR_NO_DEVICE and was unable to find libcuda.so DSO loaded into this program but when the strange thing is that when the module is imported all the libraries are loaded correctly.

Log:

root@5b1e79697b49:~# python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcublas.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcudnn.so.6.5 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcufft.so.7.0 locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcurand.so.7.0 locally
>>> 
>>> with tf.Session() as sess:
...   with tf.device("/gpu:0"):
...     matrix1 = tf.constant([[3., 3.]])
...     matrix2 = tf.constant([[2.],[2.]])
...     product = tf.matmul(matrix1, matrix2)
...     sess.run(product)
... 
I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 8
E tensorflow/stream_executor/cuda/cuda_driver.cc:481] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:114] retrieving CUDA diagnostic information for host: 5b1e79697b49
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:121] hostname: 5b1e79697b49
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:146] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:257] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  358.16  Mon Nov 16 19:25:55 PST 2015
GCC version:  gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) 
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:150] kernel reported version is: 358.16
I tensorflow/core/common_runtime/gpu/gpu_init.cc:127] DMA: 
I tensorflow/core/common_runtime/direct_session.cc:58] Direct session inter op parallelism threads: 8
Traceback (most recent call last):
  File "<stdin>", line 6, in <module>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 368, in run
    results = self._do_run(target_list, unique_fetch_targets, feed_dict_string)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 444, in _do_run
    e.code)
tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'Const_1': Could not satisfy explicit device specification '/gpu:0'
         [[Node: Const_1 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [2,1] values: 2 2>, _device="/gpu:0"]()]]
Caused by op u'Const_1', defined at:
  File "<stdin>", line 4, in <module>
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/constant_op.py", line 165, in constant
    attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1834, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1043, in __init__
    self._traceback = _extract_stack()

>>> 

Versions:

  • OS: CentOS Linux elease.2.1511(Core)
  • Kernel: Linux localhost.localdomain 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
  • GPU: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
  • Docker: 1.9.1

Env:

CUDA_HOME=/usr/local/cuda-7.0
LD_LIBRARY_PATH=/usr/local/cuda-7.0/lib64:

Command:

docker run -it -v /usr/lib/x86_64-linux-gnu/libcudadevrt.a:/usr/lib/x86_64-linux-gnu/libcudadevrt.a -v /usr/lib/x86_64-linux-gnu/libcudart.so:/usr/lib/x86_64-linux-gnu/libcudart.so -v /usr/lib/x86_64-linux-gnu/libcudart.so.7.0:/usr/lib/x86_64-linux-gnu/libcudart.so.7.0 -v /usr/lib/x86_64-linux-gnu/libcudart.so.7.0.28:/usr/lib/x86_64-linux-gnu/libcudart.so.7.0.28 -v /usr/lib/x86_64-linux-gnu/libcudart_static.a:/usr/lib/x86_64-linux-gnu/libcudart_static.a -v /usr/lib/x86_64-linux-gnu/libcuda.so:/usr/lib/x86_64-linux-gnu/libcuda.so --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-modeset:/dev/nvidia-modeset b.gcr.io/tensorflow/tensorflow:latest-gpu

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 1
  • Comments: 28 (13 by maintainers)

Most upvoted comments

import os

os.environ[“CUDA_DEVICE_ORDER”]=“PCI_BUS_ID”
os.environ[“CUDA_VISIBLE_DEVICES”]=“0”

Here are the debugging steps that I recently went through to get TensorFlow working (kind of) with GPU and Docker. (I say “kind of” because there are still some GPU-related bugs in Tensorflow, which caused some test failures and will likely cause some user-code errors in that regard as well).

See: https://github.com/tensorflow/tensorflow/issues/952 https://github.com/tensorflow/tensorflow/issues/953

That said, here are the things you want to check on:

  1. On the host, outside Docker, you have NVIDIA driver installed. If you have it installed, the binaries “nvidia-smi” and “nvidia-debugdump” ought to be available on the host. Make sure that the following two commands list your GPU: nvidia-smi nvidia-debugdump -l

  2. On the host, the output of nvidia-smi tells you the version of the NVIDIA driver installed. It needs to be recent enough for your GPU. For example, on my machine, version 340 doesn’t work, but version 352 does.

  3. On the host, get the CUDA sample code and compile the deviceQuery binary: http://docs.nvidia.com/cuda/cuda-samples/#axzz3z3C3lhk1

For this you’ll need to install the CUDA toolkit, which includes the nvcc compiler and supporting libraries. https://developer.nvidia.com/cuda-downloads

Once the deviceQuery binary is compiled, try to run it ./deviceQuery

If it fails, don’t panic, just try sudo ./deviceQuery

There are some file permissions issues related to NVIDIA devices in /dev that requires you to do sudo like the above for each boot cycle, for detailed dicussion, see: https://devtalk.nvidia.com/default/topic/749939/cuda-is-not-active-unless-i-run-it-with-sudo-privillages-/

  1. After step 3, CUDA GPU is all set on the host. Let’s now look inside Docker. Make sure that you install the same CUDA Toolkit as listed in step 3. TensorFlow additionally requires CUDA DNN (cudnn) libraries which you can also get from the CUDA website, after a somewhat time consuming user approval process.

After this make sure that the following files are present in your docker, as they will be used by (the current version of) TensorFlow:

/usr/local/cuda/include/cudnn.h 
/usr/local/cuda/lib64/libcudnn_static.a
/usr/local/cuda/lib64/libcudnn.so.6.5.48 
/usr/local/cuda/lib64/libcudnn.so.6.5/libcudnn.so.6.5.48 
/usr/local/cuda/lib64/libcudnn.so/libcudnn.so.6.5

It is okay for NVIDIA driver to be unavailable inside the Docker, even though the CUDA Toolkit is required inside it.

  1. Before you can try to start your Docker container, make sure you use docker flags to map a few devices so the the NVIDIA devices are available under /dev, inside the container:
"--device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidiactl:/dev/nvidiactl"

Also, map a bunch of lib files, to make sure that cuda library files are visible inside the container

"-v /usr/lib/x86_64-linux-gnu/libcudadevrt.a:/usr/lib/x86_64-linux-gnu/libcudadevrt.a" "-v /usr/lib/x86_64-linux-gnu/libcudart.so:/usr/lib/x86_64-linux-gnu/libcudart.so" "-v /usr/lib/x86_64-linux-gnu/libcudart.so.5.5:/usr/lib/x86_64-linux-gnu/libcudart.so.5.5" "-v /usr/lib/x86_64-linux-gnu/libcudart.so.5.5.22:/usr/lib/x86_64-linux-gnu/libcudart.so.5.5.22" "-v /usr/lib/x86_64-linux-gnu/libcudart_static.a:/usr/lib/x86_64-linux-gnu/libcudart_static.a" "-v /usr/lib/x86_64-linux-gnu/libcuda.so:/usr/lib/x86_64-linux-gnu/libcuda.so" "-v /usr/lib/x86_64-linux-gnu/libcuda.so.1:/usr/lib/x86_64-linux-gnu/libcuda.so.1" "-v /usr/lib/x86_64-linux-gnu/libcuda.so.352.63:/usr/lib/x86_64-linux-gnu/libcuda.so.352.63"

Now the NVIDIA docker container should be ready to run TensorFlow with GPU.