tensorflow: Docker with GPU 2.3rc0 CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

It seem that the Docker image tensorflow/tensorflow:2.3.0rc0-gpu won’t work with my GPU BUT on the other hand the image tensorflow/tensorflow:2.2.0rc0-gpu works fine

Or in other words, the solution to the present issue was to “downgrade” to tensorflow/tensorflow:2.2.0rc0-gpu tensorflow/tensorflow:2.3.0rc0-gpu also works fine with CPU only.

System information

  • Ubuntu 20.4
  • TensorFlow through Docker
  • TensorFlow version (use command below):
  • GPU model and memory: Geforce GTX 960M, coreClock: 1.176GHz coreCount: 5 deviceMemorySize: 1.96GiB deviceMemoryBandwidth: 74.65GiB/s
  • GPU drivers: 440.100

how to reproduce

> docker run -it --rm --gpus all  --entrypoint bash tensorflow/tensorflow:2.3.0rc0-gpu
> python
>>> import tensorflow as tf
>>> inputs = tf.keras.layers.Input(shape=(None,), name="input")
>>> embedded = tf.keras.layers.Embedding(100, 16)(inputs)

full stack trace:

2020-07-06 18:46:55.604377: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-07-06 18:46:55.608404: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-06 18:46:55.608911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 960M computeCapability: 5.0
coreClock: 1.176GHz coreCount: 5 deviceMemorySize: 1.96GiB deviceMemoryBandwidth: 74.65GiB/s
2020-07-06 18:46:55.608943: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-07-06 18:46:55.610544: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-07-06 18:46:55.611696: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-07-06 18:46:55.611988: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-07-06 18:46:55.613589: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-07-06 18:46:55.614478: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-07-06 18:46:55.618025: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-07-06 18:46:55.618159: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-06 18:46:55.618734: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-06 18:46:55.619206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-07-06 18:46:55.619480: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-07-06 18:46:55.643133: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2693910000 Hz
2020-07-06 18:46:55.643781: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x44161a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-06 18:46:55.643809: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-07-06 18:46:55.725002: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-06 18:46:55.725324: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x44aa610 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-07-06 18:46:55.725349: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 960M, Compute Capability 5.0
2020-07-06 18:46:55.725532: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-06 18:46:55.725767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 960M computeCapability: 5.0
coreClock: 1.176GHz coreCount: 5 deviceMemorySize: 1.96GiB deviceMemoryBandwidth: 74.65GiB/s
2020-07-06 18:46:55.725796: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-07-06 18:46:55.725828: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-07-06 18:46:55.725854: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-07-06 18:46:55.725882: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-07-06 18:46:55.725908: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-07-06 18:46:55.725938: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-07-06 18:46:55.725988: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-07-06 18:46:55.726091: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-06 18:46:55.726485: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-06 18:46:55.726724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-07-06 18:46:55.726756: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 926, in __call__
    input_list)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 1098, in _functional_construction_call
    self._maybe_build(inputs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 2643, in _maybe_build
    self.build(input_shapes)  # pylint:disable=not-callable
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/utils/tf_utils.py", line 323, in wrapper
    output_shape = fn(instance, input_shape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/layers/embeddings.py", line 135, in build
    if (context.executing_eagerly() and context.context().num_gpus() and
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 1082, in num_gpus
    self.ensure_initialized()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 539, in ensure_initialized
    context_handle = pywrap_tfe.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 58 (18 by maintainers)

Commits related to this issue

Most upvoted comments

For Nvidia 3090, Ubuntu 20.04, Cuda 10.1, Cudnn 7.6, Nvidia GPU driver 455 have the same isseu

Apologies for adding more activity to this issue @av8ramit but we wanted to find out if there was going to be a point release of the TensorFlow C library v2.3 that has been patched with the correct CUDA capabilities? I only ask because v2.3 is the current stable version, it works with the standard CUDA version in Ubuntu 20.04, and when installing tensorflow through python for training with keras it also uses the same version.

We removed PTX for all but sm_70 from TF builds in cf1b6b3dfe9ba82e805fddf7f4462b2d92fe550a. We never shipped with kernels for sm_50, only sm_52. Apparently the driver was able to compile PTX for sm_52 to sm_50, even though it’s not officially supported.

If you want to run on a sm_50 card, it would be best to build TF from source.

Apolgies. It seems our CI uploaded the wrong package under the new name after we refactored parts of the CI. I think it should be fixed now, can you give it a try please?

mihaimaruseac@ankh:/tmp$ sha256sum libtensorflow-gpu-linux-x86_64-2.3.*
5e4d934fd7602b6d002a89b972371151aa478ea99bf1a70987833e11c34c8875  libtensorflow-gpu-linux-x86_64-2.3.0.tar.gz
bdfb52677cf9793dcd7b66254b647c885c417967629f58af4a19b386fa7e7e0f  libtensorflow-gpu-linux-x86_64-2.3.1.tar.gz

Looping in the release manager. @geetachavan1 would we be able to patch the fix for libtensorflow and release new binaries with the correct CUDA capabilities. Happy to help get this done internally.

Glad I could help a little @motrek You should be able to link to python3.8 if you have the package libpython3.8-dev installed.

For linking to the _pywrap_tensorflow library I just created a symlink to it in /usr/local/lib

lrwxrwxrwx 1 root root    93 Sep 16 11:16  lib_pywrap_tensorflow.so -> /home/dan/.local/lib/python3.8/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so

and then ran ldconfig. At which point it can be linked and also found at runtime.

There’s a slightly cleaner solution to setting the “allow growth” option by including the experimental header

#include <tensorflow/c/c_api_experimental.h>

and then use the TF_CreateConfig helper.

auto options = TF_NewSessionOptions();
auto config  = TF_CreateConfig(true, true, 8);

TF_SetConfig(options, config->data, config->length, this->status);
TF_DeleteBuffer(config);

Use the session options as normal.

No, the driver will be able to JIT compute_70 and use it for any compute capabilities 7.x. the startup may be slow, but it will work.