tensorflow: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered.

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

This is an urgent issue! It has caused a production problem!

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux (Amazon Sagemaker)
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): latest: 2.5
  • Python version: 3.8
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 11.020/8.202
  • GPU model and memory: Tesla T4/13.8 GB

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:

  1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
  2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior I updated my model and when I train, I get the following error:

` 2021-07-11T23:08:22.514-04:00 2021-07-12 03:08:22.053193: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: “Softmax” attr { key: “T” value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: “GPU” vendor: “NVIDIA” model: “Tesla T4” frequency: 1590 num_cores: 40 environment { key: “architecture” value: “7.5” } environment { key: “cuda” value: “11020” } environment { key: “cudnn” value: “8100” } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 4194304 shared_memory_size_per_multiprocessor: 65536 memory_size: 14474280960 bandwidth: 320064000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }

2021-07-11T23:08:22.514-04:00 2021-07-12 03:08:22.053330: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: “Softmax” attr { key: “T” value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: “GPU” vendor: “NVIDIA” model: “Tesla T4” frequency: 1590 num_cores: 40 environment { key: “architecture” value: “7.5” } environment { key: “cuda” value: “11020” } environment { key: “cudnn” value: “8100” } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 4194304 shared_memory_size_per_multiprocessor: 65536 memory_size: 14474280960 bandwidth: 320064000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }

2021-07-11T23:08:32.516-04:00 2021-07-12 03:08:32.060022: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8

2021-07-11T23:08:34.517-04:00 2021-07-12 03:08:34.183639: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8202

2021-07-11T23:08:36.517-04:00 2021-07-12 03:08:36.256607: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11

2021-07-11T23:08:38.518-04:00 2021-07-12 03:08:38.161603: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11

2021-07-11T23:08:50.608-04:00

2021-07-12 03:08:50.058288: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered 2021-07-12 03:08:50.058288: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered`

Describe the expected behavior

Contributing

  • Do you want to contribute a PR? (yes/no):
  • Briefly describe your candidate solution(if contributing):

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

LOGS: logs.txt

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 29 (5 by maintainers)

Most upvoted comments

@sanjoy The same thing happened with us as well, I launched with CUDA_LAUNCH_BLOCKING, but then no errors occurred. FYI, we started running into this problem while trying to use cuDNN LSTM/GRU layer in our model. Earlier the input was left-padded which is why cuDNN kernel was not being used, we just switched the padding & this issue started happening.

Config: Ubuntu 20.04.3 LTS Docker 20.10.8 Image: tensorflow/tensorflow:2.6.0-gpu-jupyter GPU: Telsa P40 (Azure VM) Drivers: 470.57.02 CUDA: 11.4

Hi all, the original problem causing cuDNN implementation of masked LSTM/GRU to fail with the CUDA_ERROR_ILLEGAL_ADDRESS on TF 2.5+ has been solved in cuDNN 8.9.2. So if you upgrade cuDNN to this version (you can use official TF build, even if it is build for cuDNN 8.2, the CUDA/cuDNN are backward compatible within the same major version), it should be fine – see #60192.