tensorflow: TF 2.4.0 build from source gets InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid.
Nvidia-SMI command issued from inside the container NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: ERR! | |-------------------------------±---------------------±---------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce GTX 107… Off | 00000000:01:00.0 Off | N/A | | 0% 33C P8 6W / 180W | 193MiB / 8117MiB | 0% Default | | | | N/A | ±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |============================================================================
I am running Ubuntu 20.04. I followed the instructions to Build from source:
After I compiled TF inside the container, I committed and saved it.
I run the following commands to load the image and execute jupyter notebook: docker run --gpus all --ipc=“host” -it -w /tensorflow -v $PWD:/mnt -p 8888:8888 -e HOST_PERMS=“$(id -u)😒(id -g)” tensorflow/tensorflow:from-src2 bash export LD_LIBRARY_PATH=“/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/include/x64_64-linux-gnu:/usr/local/nvidia/lib:/usr/local/nvidia/lib64” pip install jupyter pip install jupyter_http_over_ws jupyter serverextension enable --py jupyter_http_over_ws jupyter notebook --no-browser --notebook-dir=/mnt/notebooks --ip=0.0.0.0 --debug --NotebookApp.allow_origin=‘https://www.example.com’ --NotebookApp.allow_remote_access=True --allow-root
This gets me a running notebook server. I try to run the tensorflow-tutorials/text_classification.ipynb file
When I ran the:
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory( ‘aclImdb/train’, batch_size=batch_size, validation_split=0.2, subset=‘training’, seed=seed)
In the jupyter notebook, I get: TypeError: Could not build a TypeSpec for [‘aclImdb/train/neg/4932_4.txt’, [there follows many pages of text similar to the above]…
Then I get the following
**with type list
During handling of the above exception, another exception occurred:**
InternalError Traceback (most recent call last) <ipython-input-10-09c13e5c92d7> in <module> 7 validation_split=0.2, 8 subset=‘training’, ----> 9 seed=seed)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/preprocessing/text_dataset.py in text_dataset_from_directory(directory, labels, label_mode, class_names, batch_size, max_length, shuffle, seed, validation_split, subset, follow_links) 159 label_mode=label_mode, 160 num_classes=len(class_names), –> 161 max_length=max_length) 162 if shuffle:
163 # Shuffle locally at each iteration
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/preprocessing/text_dataset.py in paths_and_labels_to_dataset(file_paths, labels, label_mode, num_classes, max_length) 175 max_length): 176 “”“Constructs a dataset of text strings and labels.”“” –> 177 path_ds = dataset_ops.Dataset.from_tensor_slices(file_paths) 178 string_ds = path_ds.map( 179 lambda x: path_to_string_content(x, max_length))
/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py in from_tensor_slices(tensors)
680 Dataset: A Dataset
.
681 “”"
–> 682 return TensorSliceDataset(tensors)
683
684 class _GeneratorState(object):
/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py in init(self, element)
2999 def init(self, element):
3000 “”“See Dataset.from_tensor_slices()
for details.”“”
-> 3001 element = structure.normalize_element(element)
3002 batched_spec = structure.type_spec_from_value(element)
3003 self._tensors = structure.to_batched_tensor_list(batched_spec, element)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/util/structure.py in normalize_element(element) 96 # the value. As a fallback try converting the value to a tensor. 97 normalized_components.append( —> 98 ops.convert_to_tensor(t, name=“component_%d” % i)) 99 else: 100 if isinstance(spec, sparse_tensor.SparseTensorSpec):
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, dtype_hint, ctx, accepted_result_types) 1524 1525 if ret is None: -> 1526 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) 1527 1528 if ret is NotImplemented:
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref) 337 as_ref=False): 338 _ = as_ref –> 339 return constant(v, dtype=dtype, name=name) 340 341
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name) 263 “”" 264 return _constant_impl(value, dtype, shape, name, verify_shape=False, –> 265 allow_broadcast=True) 266 267
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast) 274 with trace.Trace(“tf.constant”): 275 return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) –> 276 return _constant_eager_impl(ctx, value, dtype, shape, verify_shape) 277 278 g = ops.get_default_graph()
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in _constant_eager_impl(ctx, value, dtype, shape, verify_shape) 299 def _constant_eager_impl(ctx, value, dtype, shape, verify_shape): 300 “”“Implementation of eager constant.”“” –> 301 t = convert_to_eager_tensor(value, ctx, dtype) 302 if shape is None: 303 return t
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype) 95 except AttributeError: 96 dtype = dtypes.as_dtype(dtype).as_datatype_enum —> 97 ctx.ensure_initialized() 98 return ops.EagerTensor(value, ctx.device_name, dtype) 99
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py in ensure_initialized(self) 547 if self._use_tfrt is not None: 548 pywrap_tfe.TFE_ContextOptionsSetTfrt(opts, self._use_tfrt) –> 549 context_handle = pywrap_tfe.TFE_NewContext(opts) 550 finally: 551 pywrap_tfe.TFE_DeleteContextOptions(opts)
InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid.
This from /tensorflow_src/.bazelrc : release_gpu_common --action_env=TF_CUDA_COMPUTE_CAPABILITIES="sm_35,sm_37,sm_52,sm_60,sm_61,compute_70
I believe the GeForce 1070 is sm_61 compute level.
Some software versions gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Python 3.6.9 nvcc: NVIDIA ® Cuda compiler driver Copyright © 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
- TensorFlow installed from (source or binary):
- TensorFlow version (use command below):
- Python version:
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version:
- GPU model and memory:
You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:
- TF 1.0:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
- TF 2.0:
python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
Describe the expected behavior
Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.
Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
I will attach the full .bazelrc file and a piped output of the build from source when I can figure out how to do that. I’m on an iPad now and can copy and paste but can’t seem to figure out how to copy a file to the ipad and then upload to github issue…
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 30 (3 by maintainers)
@RayLucchesi “git checkout v2.3.0”, for example - see https://www.tensorflow.org/install/source#download_the_tensorflow_source_code.