tensorflow: Failed to synchronize the stop event
System information
-
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
-
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
-
TensorFlow installed from (source or binary): source
-
TensorFlow version (use command below): b’v1.4.0-0-gd752244’ 1.4.0
-
Python version: 3.5.2
-
Bazel version (if compiling from source): 0.7.0
-
GCC/Compiler version (if compiling from source): gcc (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609
-
CUDA/cuDNN version: 9.0/7.0
-
GPU model and memory: Tesla V100-SXM2-16GB
-
Exact command to reproduce:
git clone https://github.com/ljanyst/image-segmentation-fcn.git
cd image-segmentation-fcn
wget http://www.cvlibs.net/download.php?file=data_road.zip
unzip data_road.zip
./train.py --data-dir data_road
Describe the problem
It seems like I am hitting some sort of a CUDA/cuDNN synchronization/race issue. Please see the snippet in the next section for the exact error message. The problem only happens with the KITTI dataset. The exact same TensorFlow code works fine for the Cityscapes dataset. Also, the problem only happens on Tesla V100. I tested the same exact software configuration on Tesla K80 and GeForce GTX1080 Ti as well, and things work fine.
Source code / logs
2017-11-08 12:24:52.838039: E tensorflow/stream_executor/cuda/cuda_driver.cc:1080] failed to synchronize the stop event: CUDA_ERROR_ILLEGAL_ADDRESS
2017-11-08 12:24:52.838090: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0x51f18f0: CUDA_ERROR_ILLEGAL_ADDRESS
2017-11-08 12:24:52.838106: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0x51f18f0: CUDA_ERROR_ILLEGAL_ADDRESS
2017-11-08 12:24:52.838137: F tensorflow/stream_executor/cuda/cuda_dnn.cc:3218] failed to set stream for cudnn handle: CUDNN_STATUS_MAPPING_ERROR
zsh: abort (core dumped) ./train.py --data-dir data_road
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 5
- Comments: 34 (6 by maintainers)
@zheng-xq @ljanyst We have a repro and a fix. Roll out is planned in cuDNN 7.0.5 mid-December.
I still encounter the same problem as others reported, with CUDA 9.0 and cudnn 7.0.2.
If I tried cudnn 7.1.2, I got a different error:
These happen for multiple versions of tensorflow I tried, from 1.5 to 1.7
Came here to report the exact same thing with our Volta, using the Tensorflow container on NVIDIA GPU Cloud. We will be happy to test the fix with cuDNN 7.0.5 and follow-up. Please let us know if there are any other updates on this issue or if more information is needed.
The synchronization error is only what finds out the issue. The root cause is some GPU kernels had an illegal address access.
If someone wants to root cause this, first it is needed to find the offending kernel. In our past experience, it could be either a kernel bug, or a degenerate data entry.
I have this problem on CUDA 10.0. I’m using TF 1.10.0, keras 2.2.2, Window 10, GPU Nvidia mx150. Some NNs work with no problem, some fail.
Things work for me too now. Thanks @juliebernauer !
@juliebernauer it works after updating. Thank you a lot!