tensorflow: CUDA_ERROR_ILLEGAL_ADDRESS
Click to expand!
Issue Type
Bug
Have you reproduced the bug with TF nightly?
No
Source
binary
Tensorflow Version
2.11
Custom Code
No
OS Platform and Distribution
Official docker image via apptainer
Mobile device
No response
Python version
3.8
Bazel version
No response
GCC/Compiler version
No response
CUDA/cuDNN version
11.2/8.1
GPU model and memory
Tesla V100-SXM2-32GB
Current Behaviour?
Using the Official docker image, the provided code results in a CUDA_ERROR_ILLEGAL_ADDRESS. Note, that while the error is fully consistent (happens every time at the first epoch) it is highly sensitive. If a different seed or model-size is used, the error will not reproduce.
I have tried enabling memory growth as suggested elsewhere. However, this only delays the issue to epoch 18.
The error does not reproduce on Google Colab or when using the nightly docker image. However, due to the sensitivity of the issue, it is unclear what exactly this means. The Google Colab issue could be due to different GPU. On my own environment it’s a Tesla V100-SXM2-32GB, not a T4.
Standalone code to reproduce the issue
This uses transformers == 4.26.0
, and a synthetic dataset file (based on QQP) that is automatically downloaded. Code is available here: https://gist.github.com/AndreasMadsen/2bf669a3cd4c4a8ba964561b9e72279e
Note, I did make an attempt at reproducing it using Colab: https://colab.research.google.com/drive/1zVbNNfz1lZ6xgyoZ7dKWnKzj6poM5XWC?usp=sharing
However, the error does not reproduce on Colab.
Relevant log output
2023-02-06 14:07:37.519351: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-06 14:07:37.759155: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-06 14:07:38.624178: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-02-06 14:07:38.624677: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-02-06 14:07:38.624725: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Configuration:
persistent_dir = /scratch/anmadc/tf_memory_issue
seed = 0
batch_size = 16
model = roberta-base
jit_compile = True
precision = mixed_float16
2023-02-06 14:07:40.255950: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-06 14:07:41.029530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30971 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:af:00.0, compute capability: 7.0
All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.
Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
WARNING:tensorflow:From /localscratch/anmadc.58777029.0/env/lib/python3.10/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
2023-02-06 14:08:24.250220: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x2db1cb30 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-02-06 14:08:24.264540: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2023-02-06 14:08:25.275879: W tensorflow/compiler/tf2xla/kernels/assert_op.cc:38] Ignoring Assert operator tf_roberta_for_sequence_classification/roberta/embeddings/assert_less/Assert/Assert
2023-02-06 14:08:25.309985: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-02-06 14:08:25.321306: W tensorflow/compiler/tf2xla/kernels/random_ops.cc:57] Warning: Using tf.random.uniform with XLA compilation will ignore seeds; consider using tf.random.stateless_uniform instead if reproducible behavior is desired. tf_roberta_for_sequence_classification/roberta/embeddings/dropout_36/dropout/random_uniform/RandomUniform
2023-02-06 14:09:41.534987: I tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:325] ptxas warning : Registers are spilled to local memory in function 'sort_19'
ptxas warning : Registers are spilled to local memory in function 'sort_17'
2023-02-06 14:09:41.931415: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
2023-02-06 14:09:42.331286: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1159] failed to enqueue async memcpy from device to host: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; host dst: 0x2b36f334aeeb; GPU src: 0x2b385f628600; size: 1=0x1
2023-02-06 14:09:42.331472: I tensorflow/compiler/xla/stream_executor/stream.cc:2485] INTERNAL: Unknown error
2023-02-06 14:09:42.331557: I tensorflow/compiler/xla/stream_executor/stream.cc:2489] [stream=0x1b700120,impl=0x52a9880] INTERNAL: stream did not block host until done; was already in an error state
Traceback (most recent call last):
File "/scratch/anmadc/workspace/economical-roar/reproduce.py", line 150, in <module>
model.fit(dataset_train_batched, verbose=1, epochs=1)
File "/localscratch/anmadc.58777029.0/env/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/localscratch/anmadc.58777029.0/env/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
Detected at node 'StatefulPartitionedCall' defined at (most recent call last):
File "/scratch/anmadc/workspace/economical-roar/reproduce.py", line 150, in <module>
model.fit(dataset_train_batched, verbose=1, epochs=1)
File "/localscratch/anmadc.58777029.0/env/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/localscratch/anmadc.58777029.0/env/lib/python3.10/site-packages/keras/engine/training.py", line 1650, in fit
tmp_logs = self.train_function(iterator)
File "/localscratch/anmadc.58777029.0/env/lib/python3.10/site-packages/keras/engine/training.py", line 1249, in train_function
return step_function(self, iterator)
File "/localscratch/anmadc.58777029.0/env/lib/python3.10/site-packages/keras/engine/training.py", line 1233, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
Node: 'StatefulPartitionedCall'
Failed to retrieve branch_index value on stream 0x1b700120: stream did not block host until done; was already in an error state.
[[{{node StatefulPartitionedCall}}]] [Op:__inference_train_function_42578]
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17 (16 by maintainers)
Hi, @sachinprasadhs
Could you please look into this issue ? Thank you!