tensorflow: CUDA illegal error access error when running distributed mixed precision
Whenever I try to train a model using MirroredStrategy and mixed precision, at an indeterminate time, I get the following error:
./tensorflow/core/kernels/conv_2d_gpu.h:970] Non-OK-status: GpuLaunchKernel( SwapDimension1And2InTensor3UsingTiles<T, kNumThreads, kTileSize, kTileSize, conjugate>, total_tiles_count, kNumThreads, 0, d.stream(), input, input_dims, output) status: Internal: an illegal memory access was encountered
2020-06-25 00:45:27.788127: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-06-25 00:45:27.788208: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
Unfortunately, I don’t have a simply example to reproduce this and can’t include my entire code. But maybe other people are having similar issues and can produce a better example.
I’m running tensorflow 2.2.0 on ubuntu 18.04. CUDA 10.1.243, CuDNN 7.6.5 using two RTX 2080 ti cards. I get the same error on a V100.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 23 (5 by maintainers)
Ok guys, I think I’ve found a solution, which seems to work for me.
I followed the instructions here: https://github.com/NVIDIA/framework-determinism - I enabled
os.environ['TF_CUDNN_DETERMINISTIC']='1'
Then I fixed all the random seeds:
The model has been running without a hitch for many epochs now. Seems that the non-determinism of some operations might cause these multi-gpu issues. Keep in mind - I don’t fully understand WHY this works, just know that it does work for a similar problem. Do let me know if this helps.
Also, keep in mind the instructions here: https://github.com/NVIDIA/framework-determinism are a bit different from the ones I originally used (here; https://stackoverflow.com/questions/50744565/how-to-handle-non-determinism-when-training-on-a-gpu/62712389#62712389). Might be worth trying both sets.
@gowthamkpr, I have a reproducible example. This will crash if I run:
TF_FORCE_GPU_ALLOW_GROWTH=true python fail.py
.Where
fail.py
is as below. Interestingly, it does not crash with I don’t set TF_FORCE_GPU_ALLOW_GROWTH to true. I’m running tensorflow 2.2.0 on ubuntu 18.04. CUDA 10.1.243, CuDNN 7.6.5 using two RTX 2080 ti cards. I get the same error on a V100. This only happens if I enable mixed precision and Mirrored distribute strategy.The error is as follows: