tensorflow: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid?

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 20.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): v2.3.0-rc2-23-gb36436b087 2.3.0
  • Python version: 3.8.2
  • CUDA/cuDNN version: Cuda 10.1/ cuDNN 7.6.5
  • GPU model and memory: Nvidia GTX 750Ti
  • Exact command to reproduce:
# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras

fashion_mnist = keras.datasets.fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
train_images = train_images / 255.0
test_images = test_images / 255.0

# At this step I was getting the error which I've posted below in the terminal.

model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(10)
])

Describe the problem

I’ve recently installed ubuntu 20.04 LTS and it comes with python-3.8, so I’ll installed nvidia-cuda-toolkit and nvidia drivers and I can confirm they are working fine.

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
$ nvidia-smi
Mon Aug  3 02:56:11 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 750 Ti  Off  | 00000000:01:00.0  On |                  N/A |
| 27%   38C    P0     1W /  38W |    245MiB /  1997MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       979      G   /usr/lib/xorg/Xorg                            20MiB |
+-----------------------------------------------------------------------------+

Now, I tried to build a small sequential model I am getting an error which says InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

I don’t know what causing the issue. My linux ubuntu is a new installation. I have installed everything correctly.

Source code / logs

2020-08-03 02:48:40.720575: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-08-03 02:48:40.750630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 750 Ti computeCapability: 5.0
coreClock: 1.137GHz coreCount: 5 deviceMemorySize: 1.95GiB deviceMemoryBandwidth: 80.47GiB/s
2020-08-03 02:48:40.750735: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-08-03 02:48:40.791690: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-08-03 02:48:40.815993: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-08-03 02:48:40.821924: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-08-03 02:48:40.863910: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-08-03 02:48:40.870559: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-08-03 02:48:40.945916: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-08-03 02:48:40.947130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-08-03 02:48:40.979471: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3311130000 Hz
2020-08-03 02:48:40.980123: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4aa6700 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-03 02:48:40.980190: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-08-03 02:48:41.121266: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x49375f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-08-03 02:48:41.121357: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 750 Ti, Compute Capability 5.0
2020-08-03 02:48:41.122574: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 750 Ti computeCapability: 5.0
coreClock: 1.137GHz coreCount: 5 deviceMemorySize: 1.95GiB deviceMemoryBandwidth: 80.47GiB/s
2020-08-03 02:48:41.122676: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-08-03 02:48:41.122762: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-08-03 02:48:41.122830: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-08-03 02:48:41.122898: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-08-03 02:48:41.122963: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-08-03 02:48:41.123029: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-08-03 02:48:41.123145: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-08-03 02:48:41.124618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-08-03 02:48:41.124716: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1

---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
<ipython-input-4-ac4dc71cdd20> in <module>
----> 1 model = keras.Sequential([
      2     keras.layers.Flatten(input_shape=(28, 28)),
      3     keras.layers.Dense(128, activation='relu'),
      4     keras.layers.Dense(10)
      5 ])

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py in _method_wrapper(self, *args, **kwargs)
    455     self._self_setattr_tracking = False  # pylint: disable=protected-access
    456     try:
--> 457       result = method(self, *args, **kwargs)
    458     finally:
    459       self._self_setattr_tracking = previous_value  # pylint: disable=protected-access

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/keras/engine/sequential.py in __init__(self, layers, name)
    114     """
    115     # Skip the init in FunctionalModel since model doesn't have input/output yet
--> 116     super(functional.Functional, self).__init__(  # pylint: disable=bad-super-call
    117         name=name, autocast=False)
    118     self.supports_masking = True

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py in _method_wrapper(self, *args, **kwargs)
    455     self._self_setattr_tracking = False  # pylint: disable=protected-access
    456     try:
--> 457       result = method(self, *args, **kwargs)
    458     finally:
    459       self._self_setattr_tracking = previous_value  # pylint: disable=protected-access

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in __init__(self, *args, **kwargs)
    306     self._steps_per_execution = None
    307 
--> 308     self._init_batch_counters()
    309     self._base_model_initialized = True
    310     _keras_api_gauge.get_cell('model').set(True)

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py in _method_wrapper(self, *args, **kwargs)
    455     self._self_setattr_tracking = False  # pylint: disable=protected-access
    456     try:
--> 457       result = method(self, *args, **kwargs)
    458     finally:
    459       self._self_setattr_tracking = previous_value  # pylint: disable=protected-access

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py in _init_batch_counters(self)
    315     # `evaluate`, and `predict`.
    316     agg = variables.VariableAggregationV2.ONLY_FIRST_REPLICA
--> 317     self._train_counter = variables.Variable(0, dtype='int64', aggregation=agg)
    318     self._test_counter = variables.Variable(0, dtype='int64', aggregation=agg)
    319     self._predict_counter = variables.Variable(

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs)
    260       return cls._variable_v1_call(*args, **kwargs)
    261     elif cls is Variable:
--> 262       return cls._variable_v2_call(*args, **kwargs)
    263     else:
    264       return super(VariableMetaclass, cls).__call__(*args, **kwargs)

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/ops/variables.py in _variable_v2_call(cls, initial_value, trainable, validate_shape, caching_device, name, variable_def, dtype, import_scope, constraint, synchronization, aggregation, shape)
    242     if aggregation is None:
    243       aggregation = VariableAggregation.NONE
--> 244     return previous_getter(
    245         initial_value=initial_value,
    246         trainable=trainable,

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/ops/variables.py in <lambda>(**kws)
    235                         shape=None):
    236     """Call on Variable class. Useful to force the signature."""
--> 237     previous_getter = lambda **kws: default_variable_creator_v2(None, **kws)
    238     for _, getter in ops.get_default_graph()._variable_creator_stack:  # pylint: disable=protected-access
    239       previous_getter = _make_getter(getter, previous_getter)

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/ops/variable_scope.py in default_variable_creator_v2(next_creator, **kwargs)
   2631   shape = kwargs.get("shape", None)
   2632 
-> 2633   return resource_variable_ops.ResourceVariable(
   2634       initial_value=initial_value,
   2635       trainable=trainable,

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs)
    262       return cls._variable_v2_call(*args, **kwargs)
    263     else:
--> 264       return super(VariableMetaclass, cls).__call__(*args, **kwargs)
    265 
    266 

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py in __init__(self, initial_value, trainable, collections, validate_shape, caching_device, name, dtype, variable_def, import_scope, constraint, distribute_strategy, synchronization, aggregation, shape)
   1505       self._init_from_proto(variable_def, import_scope=import_scope)
   1506     else:
-> 1507       self._init_from_args(
   1508           initial_value=initial_value,
   1509           trainable=trainable,

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py in _init_from_args(self, initial_value, trainable, collections, caching_device, name, dtype, constraint, synchronization, aggregation, distribute_strategy, shape)
   1648         with ops.get_default_graph()._attr_scope({"_class": attr}):
   1649           with ops.name_scope("Initializer"), device_context_manager(None):
-> 1650             initial_value = ops.convert_to_tensor(
   1651                 initial_value() if init_from_fn else initial_value,
   1652                 name="initial_value", dtype=dtype)

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types)
   1497 
   1498     if ret is None:
-> 1499       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
   1500 
   1501     if ret is NotImplemented:

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/framework/tensor_conversion_registry.py in _default_conversion_function(***failed resolving arguments***)
     50 def _default_conversion_function(value, dtype, name, as_ref):
     51   del as_ref  # Unused.
---> 52   return constant_op.constant(value, dtype, name=name)
     53 
     54 

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name)
    261     ValueError: if called on a symbolic tensor.
    262   """
--> 263   return _constant_impl(value, dtype, shape, name, verify_shape=False,
    264                         allow_broadcast=True)
    265 

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
    273       with trace.Trace("tf.constant"):
    274         return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
--> 275     return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    276 
    277   g = ops.get_default_graph()

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py in _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
    298 def _constant_eager_impl(ctx, value, dtype, shape, verify_shape):
    299   """Implementation of eager constant."""
--> 300   t = convert_to_eager_tensor(value, ctx, dtype)
    301   if shape is None:
    302     return t

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype)
     95     except AttributeError:
     96       dtype = dtypes.as_dtype(dtype).as_datatype_enum
---> 97   ctx.ensure_initialized()
     98   return ops.EagerTensor(value, ctx.device_name, dtype)
     99 

/mnt/Work/work_env/lib/python3.8/site-packages/tensorflow/python/eager/context.py in ensure_initialized(self)
    537         if self._use_tfrt is not None:
    538           pywrap_tfe.TFE_ContextOptionsSetTfrt(opts, self._use_tfrt)
--> 539         context_handle = pywrap_tfe.TFE_NewContext(opts)
    540       finally:
    541         pywrap_tfe.TFE_DeleteContextOptions(opts)

InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 25
  • Comments: 41 (11 by maintainers)

Most upvoted comments

Yep, my read is that some summer intern thought it was a good idea to not support old hardware anymore to reduce the pip binary file size and to make some metrics like tensorflow average startup time go down (TF2 is a lot slower than TF1 on those old GPUs).

In https://github.com/tensorflow/tensorflow/releases/tag/v2.3.0

GPU TF 2.3 includes PTX kernels only for compute capability 7.0 to reduce the TF pip binary size. Earlier releases included PTX for a variety of older compute capabilities.

Now it’s September, holidays should be over even though work from home is probably still in place, and a lot of people will update and discover that it screws them over one way or the other. Hopefully someone at tensorflow will take the helm back, and turn the flag back on. It’s incredibly short sighted, to make a non-backward compatible breaking change that will prevent a significant fraction of users, to use tensorflow at all.

Staying with an old tensorflow version is a no go because you can’t use the latest algorithms like @elvis1020 is showing.

I understand that some operation may benefit from the newer compute capabilities but it shouldn’t prevent glorified matrix multiplications from running.

I use my laptop as my development machine because the machines with powerful GPUs are already running. If I can’t have the same version of tensorflow on development and production it’s a deal braker for using tensorflow.

Also my old laptops are reused as robot brains/passive monitoring tools so if I can’t run tensorflow on them they become useless, so deal breaker for using tensorflow.

For information it’s kind of critical for me and will make me migrate all my code toward torch within a month if nothing is changed.

TF 2.3 doesn’t work with my laptop’s GPU “GeForce GTX 960M” which is compute capability 5.0 TF 2.2 works though. I’m not compiling from source. Guess it’s time to move to torch.

Adding the following:

import os
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

Solved the problem for me

solved this error on RTX 3070 with following specs: CUDA: 11.0 CUDNN: 8 Tensorflow: 2.4

I started having this same issue today. Getting the same error using Keras with Nvidia RTX 2080 Super. Ubuntu 20.4

Hi everyone,

TF nightly starting from the latest nightly build should now work on GPUs with compute capability 5.0 like GeForce GTX 960M. Please give it a try and let us know how it goes.

@abhipn Here are the installation instruction on TF website. Thanks!

I just close the terminal and open again, it works on me.

Hello, I am facing the same issue. I am runnin tf2.2.0 on nvidia container, i comes with cuda 11 by default, but i am not being able to recognize the gpu in the code. Now It brings me the following error (when it is trying to load mtcnn): tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory

@unrealwill if you’re running on linux you can use TF docker images to avoid having to deal with CUDA and cuDNN versions.

I have a Quadro M1200 which has compute capability 5 and I am still getting this same error.

Hi @abhipn,

You can enter the compute capability when the configure script says:

Please note that each additional compute capability significantly increases your build time and binary size, and that TensorFlow only supports compute capabilities >= 3.5 [Default is: 3.5,7.0]: (Enter 5.0, and possibly others if needed.)

As for compile time, I don’t think it will build quickly (although targeting just 5.0 should cut down on the compile time); I recommend building it on a beefy GCP VM for a quicker turnaround.

@abhipn we no longer ship PTX for some older GPUs, which includes the one you’re using. The easiest solution for you is probably to build the pip package from source with support for your GPU (sm_50) included. See https://www.tensorflow.org/install/gpu#hardware_requirements.

@abhipn Looks like the GPU was detected but it loads some libraries (like libcudart.so.10.1) are loaded from CUDA10.1 and some other libraries (for example libcublas.so.10, libcusolver.so.10 etc) are loaded from CUDA10.0. Please check your error trace.

Can you please uninstall CUDA drivers (10.1 and 10.0), unistall TF, restart, install CUDA10.1 and finally install TF. Please let me know how to progresses. Thanks!

I have the same error with a 3070 and with same tf

I have recently purchased a 3090 and I have build tensorflow==2.3.0 from source with cuda 11.1 and cudnn 8.0.4, I don’t have any issues. If you don’t want to build from source, maybe you can try the tensorflow 2.4.0 nightly and see if it works?

Running the above command today after a fresh docker pull makes the bug disappear

version :2.4.0-dev20200904 … 2020-09-05 11:58:03.649241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/device:GPU:0 with 1494 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0)

Thanks

Addentum : I couldn’t help but notice the time is not my local time and my usual trick docker -e TZ=Europe/Paris ... doesn’t seem to have an effect

my tensorflow work very well. I update all the packages of my python and after that getting same issue. what should I do to fix this?

I have the same issue today, and find nothing to solve it, have you solve it yet?