tensorflow: CUDA_ERROR_ILLEGAL_ADDRESS

Click to expand!

Issue Type

Bug

Have you reproduced the bug with TF nightly?

No

Source

binary

Tensorflow Version

2.11

Custom Code

No

OS Platform and Distribution

Official docker image via apptainer

Mobile device

No response

Python version

3.8

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

11.2/8.1

GPU model and memory

Tesla V100-SXM2-32GB

Current Behaviour?

Using the Official docker image, the provided code results in a CUDA_ERROR_ILLEGAL_ADDRESS. Note, that while the error is fully consistent (happens every time at the first epoch) it is highly sensitive. If a different seed or model-size is used, the error will not reproduce.

I have tried enabling memory growth as suggested elsewhere. However, this only delays the issue to epoch 18.

The error does not reproduce on Google Colab or when using the nightly docker image. However, due to the sensitivity of the issue, it is unclear what exactly this means. The Google Colab issue could be due to different GPU. On my own environment it’s a Tesla V100-SXM2-32GB, not a T4.

Standalone code to reproduce the issue

This uses transformers == 4.26.0, and a synthetic dataset file (based on QQP) that is automatically downloaded. Code is available here: https://gist.github.com/AndreasMadsen/2bf669a3cd4c4a8ba964561b9e72279e

Note, I did make an attempt at reproducing it using Colab: https://colab.research.google.com/drive/1zVbNNfz1lZ6xgyoZ7dKWnKzj6poM5XWC?usp=sharing

However, the error does not reproduce on Colab.

Relevant log output

2023-02-06 14:07:37.519351: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-06 14:07:37.759155: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-06 14:07:38.624178: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-02-06 14:07:38.624677: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-02-06 14:07:38.624725: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Configuration:
  persistent_dir = /scratch/anmadc/tf_memory_issue
  seed = 0
  batch_size = 16
  model = roberta-base
  jit_compile = True
  precision = mixed_float16

2023-02-06 14:07:40.255950: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-06 14:07:41.029530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30971 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:af:00.0, compute capability: 7.0
All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
WARNING:tensorflow:From /localscratch/anmadc.58777029.0/env/lib/python3.10/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
2023-02-06 14:08:24.250220: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x2db1cb30 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-02-06 14:08:24.264540: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2023-02-06 14:08:25.275879: W tensorflow/compiler/tf2xla/kernels/assert_op.cc:38] Ignoring Assert operator tf_roberta_for_sequence_classification/roberta/embeddings/assert_less/Assert/Assert
2023-02-06 14:08:25.309985: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-02-06 14:08:25.321306: W tensorflow/compiler/tf2xla/kernels/random_ops.cc:57] Warning: Using tf.random.uniform with XLA compilation will ignore seeds; consider using tf.random.stateless_uniform instead if reproducible behavior is desired. tf_roberta_for_sequence_classification/roberta/embeddings/dropout_36/dropout/random_uniform/RandomUniform
2023-02-06 14:09:41.534987: I tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:325] ptxas warning : Registers are spilled to local memory in function 'sort_19'
ptxas warning : Registers are spilled to local memory in function 'sort_17'

2023-02-06 14:09:41.931415: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
2023-02-06 14:09:42.331286: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1159] failed to enqueue async memcpy from device to host: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; host dst: 0x2b36f334aeeb; GPU src: 0x2b385f628600; size: 1=0x1
2023-02-06 14:09:42.331472: I tensorflow/compiler/xla/stream_executor/stream.cc:2485] INTERNAL: Unknown error
2023-02-06 14:09:42.331557: I tensorflow/compiler/xla/stream_executor/stream.cc:2489] [stream=0x1b700120,impl=0x52a9880] INTERNAL: stream did not block host until done; was already in an error state
Traceback (most recent call last):
  File "/scratch/anmadc/workspace/economical-roar/reproduce.py", line 150, in <module>
    model.fit(dataset_train_batched, verbose=1, epochs=1)
  File "/localscratch/anmadc.58777029.0/env/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/localscratch/anmadc.58777029.0/env/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node 'StatefulPartitionedCall' defined at (most recent call last):
    File "/scratch/anmadc/workspace/economical-roar/reproduce.py", line 150, in <module>
      model.fit(dataset_train_batched, verbose=1, epochs=1)
    File "/localscratch/anmadc.58777029.0/env/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/localscratch/anmadc.58777029.0/env/lib/python3.10/site-packages/keras/engine/training.py", line 1650, in fit
      tmp_logs = self.train_function(iterator)
    File "/localscratch/anmadc.58777029.0/env/lib/python3.10/site-packages/keras/engine/training.py", line 1249, in train_function
      return step_function(self, iterator)
    File "/localscratch/anmadc.58777029.0/env/lib/python3.10/site-packages/keras/engine/training.py", line 1233, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
Node: 'StatefulPartitionedCall'
Failed to retrieve branch_index value on stream 0x1b700120: stream did not block host until done; was already in an error state.
	 [[{{node StatefulPartitionedCall}}]] [Op:__inference_train_function_42578]

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17 (16 by maintainers)

Most upvoted comments

Hi, @sachinprasadhs

Could you please look into this issue ? Thank you!