tensorflow: TF 2.4.0 build from source gets InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid.

±----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |============================================================================

I am running Ubuntu 20.04. I followed the instructions to Build from source:

After I compiled TF inside the container, I committed and saved it.

I run the following commands to load the image and execute jupyter notebook: docker run --gpus all --ipc=“host” -it -w /tensorflow -v $PWD:/mnt -p 8888:8888 -e HOST_PERMS=“$(id -u)😒(id -g)” tensorflow/tensorflow:from-src2 bash export LD_LIBRARY_PATH=“/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/include/x64_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64” pip install jupyter pip install jupyter_http_over_ws jupyter serverextension enable --py jupyter_http_over_ws jupyter notebook --no-browser --notebook-dir=/mnt/notebooks --ip=0.0.0.0 --debug --NotebookApp.allow_origin=‘https://www.example.com’ --NotebookApp.allow_remote_access=True --allow-root

This gets me a running notebook server. I try to run the tensorflow-tutorials/text_classification.ipynb file

When I ran the:

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory( ‘aclImdb/train’, batch_size=batch_size, validation_split=0.2, subset=‘training’, seed=seed)

In the jupyter notebook, I get: TypeError: Could not build a TypeSpec for [‘aclImdb/train/neg/4932_4.txt’, [there follows many pages of text similar to the above]…

Then I get the following

**with type list

During handling of the above exception, another exception occurred:**

InternalError Traceback (most recent call last) <ipython-input-10-09c13e5c92d7> in <module> 7 validation_split=0.2, 8 subset=‘training’, ----> 9 seed=seed)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/preprocessing/text_dataset.py in text_dataset_from_directory(directory, labels, label_mode, class_names, batch_size, max_length, shuffle, seed, validation_split, subset, follow_links) 159 label_mode=label_mode, 160 num_classes=len(class_names), –> 161 max_length=max_length) 162 if shuffle:

163 # Shuffle locally at each iteration

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/preprocessing/text_dataset.py in paths_and_labels_to_dataset(file_paths, labels, label_mode, num_classes, max_length) 175 max_length): 176 “”“Constructs a dataset of text strings and labels.”“” –> 177 path_ds = dataset_ops.Dataset.from_tensor_slices(file_paths) 178 string_ds = path_ds.map( 179 lambda x: path_to_string_content(x, max_length))

/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py in from_tensor_slices(tensors) 680 Dataset: A Dataset. 681 “”" –> 682 return TensorSliceDataset(tensors) 683 684 class _GeneratorState(object):

/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py in init(self, element) 2999 def init(self, element): 3000 “”“See Dataset.from_tensor_slices() for details.”“” -> 3001 element = structure.normalize_element(element) 3002 batched_spec = structure.type_spec_from_value(element) 3003 self._tensors = structure.to_batched_tensor_list(batched_spec, element)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/util/structure.py in normalize_element(element) 96 # the value. As a fallback try converting the value to a tensor. 97 normalized_components.append( —> 98 ops.convert_to_tensor(t, name=“component_%d” % i)) 99 else: 100 if isinstance(spec, sparse_tensor.SparseTensorSpec):

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types) 1524 1525 if ret is None: -> 1526 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) 1527 1528 if ret is NotImplemented:

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref) 337 as_ref=False): 338 _ = as_ref –> 339 return constant(v, dtype=dtype, name=name) 340 341

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name) 263 “”" 264 return _constant_impl(value, dtype, shape, name, verify_shape=False, –> 265 allow_broadcast=True) 266 267

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast) 274 with trace.Trace(“tf.constant”): 275 return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) –> 276 return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) 277 278 g = ops.get_default_graph()

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in _constant_eager_impl(ctx, value, dtype, shape, verify_shape) 299 def _constant_eager_impl(ctx, value, dtype, shape, verify_shape): 300 “”“Implementation of eager constant.”“” –> 301 t = convert_to_eager_tensor(value, ctx, dtype) 302 if shape is None: 303 return t

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype) 95 except AttributeError: 96 dtype = dtypes.as_dtype(dtype).as_datatype_enum —> 97 ctx.ensure_initialized() 98 return ops.EagerTensor(value, ctx.device_name, dtype) 99

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py in ensure_initialized(self) 547 if self._use_tfrt is not None: 548 pywrap_tfe.TFE_ContextOptionsSetTfrt(opts, self._use_tfrt) –> 549 context_handle = pywrap_tfe.TFE_NewContext(opts) 550 finally: 551 pywrap_tfe.TFE_DeleteContextOptions(opts)

InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid.

This from /tensorflow_src/.bazelrc : release_gpu_common --action_env=TF_CUDA_COMPUTE_CAPABILITIES="sm_35,sm_37,sm_52,sm_60,sm_61,compute_70

I believe the GeForce 1070 is sm_61 compute level.

Some software versions gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Python 3.6.9 nvcc: NVIDIA ® Cuda compiler driver Copyright © 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary):
TensorFlow version (use command below):
Python version:
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:

TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

Describe the expected behavior

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

I will attach the full .bazelrc file and a piped output of the build from source when I can figure out how to do that. I’m on an iPad now and can copy and paste but can’t seem to figure out how to copy a file to the ipad and then upload to github issue…

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 30 (3 by maintainers)

Commits related to this issue

Revert "update to tf 2.3" This reverts commit 43e9ccd79deb3b533d9cc50cd7a7febd42bc34a8. Reason for revert: TensorFlow 2.3 don't work on Linux GPU https://github.com/tensorflow/tensorflow/issues/4197... — committed to deepjavalibrary/djl by stu1130 4 years ago

Most upvoted comments

@RayLucchesi “git checkout v2.3.0”, for example - see https://www.tensorflow.org/install/source#download_the_tensorflow_source_code.

MikhailStartsev on Aug 6, 2020