tensorflow: Failed to synchronize the stop event

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): b’v1.4.0-0-gd752244’ 1.4.0
Python version: 3.5.2
Bazel version (if compiling from source): 0.7.0
GCC/Compiler version (if compiling from source): gcc (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609
CUDA/cuDNN version: 9.0/7.0
GPU model and memory: Tesla V100-SXM2-16GB
Exact command to reproduce:

git clone https://github.com/ljanyst/image-segmentation-fcn.git
cd image-segmentation-fcn                                       
wget http://www.cvlibs.net/download.php?file=data_road.zip
unzip data_road.zip                                     
./train.py  --data-dir data_road

Describe the problem

It seems like I am hitting some sort of a CUDA/cuDNN synchronization/race issue. Please see the snippet in the next section for the exact error message. The problem only happens with the KITTI dataset. The exact same TensorFlow code works fine for the Cityscapes dataset. Also, the problem only happens on Tesla V100. I tested the same exact software configuration on Tesla K80 and GeForce GTX1080 Ti as well, and things work fine.

Source code / logs

2017-11-08 12:24:52.838039: E tensorflow/stream_executor/cuda/cuda_driver.cc:1080] failed to synchronize the stop event: CUDA_ERROR_ILLEGAL_ADDRESS
2017-11-08 12:24:52.838090: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0x51f18f0: CUDA_ERROR_ILLEGAL_ADDRESS
2017-11-08 12:24:52.838106: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0x51f18f0: CUDA_ERROR_ILLEGAL_ADDRESS
2017-11-08 12:24:52.838137: F tensorflow/stream_executor/cuda/cuda_dnn.cc:3218] failed to set stream for cudnn handle: CUDNN_STATUS_MAPPING_ERROR
zsh: abort (core dumped)  ./train.py --data-dir data_road

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 5
Comments: 34 (6 by maintainers)

Most upvoted comments

@zheng-xq @ljanyst We have a repro and a fix. Roll out is planned in cuDNN 7.0.5 mid-December.

juliebernauer on Nov 16, 2017

I still encounter the same problem as others reported, with CUDA 9.0 and cudnn 7.0.2.

If I tried cudnn 7.1.2, I got a different error:

/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: cudnn PoolForward launch 
failed
         [[Node: AvgPool3D_15 = AvgPool3D[T=DT_FLOAT, data_format="NDHWC", ksize
=[1, 2, 2, 2, 1], padding="SAME", strides=[1, 2, 2, 2, 1], _device="/job:localho
st/replica:0/task:0/device:GPU:1"](ExpandDims_1)]]
         [[Node: mul_29/_23 = _Recv[client_terminated=false, recv_device="/job:l
ocalhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/t
ask:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_47_mul_29", te
nsor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'AvgPool3D_15', defined at:
 [hiding lines related to customer codes]
  File "/home/xxx/.conda/envs/tf2/lib/python3.5/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 177, in avg_pool3d
    padding=padding, data_format=data_format, name=name)
  File "/home/xxx/.conda/envs/tf2/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/xxx/.conda/envs/tf2/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
    op_def=op_def)
  File "/home/xxx/.conda/envs/tf2/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): cudnn PoolForward launch failed
         [[Node: AvgPool3D_15 = AvgPool3D[T=DT_FLOAT, data_format="NDHWC", ksize=[1, 2, 2, 2, 1], padding="SAME", strides=[1, 2, 2, 2, 1], _device="/job:localhost/replica:0/task:0/device:GPU:1"](ExpandDims_1)]]
         [[Node: mul_29/_23 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_47_mul_29", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

These happen for multiple versions of tensorflow I tried, from 1.5 to 1.7

lcnature on Jun 6, 2018

Came here to report the exact same thing with our Volta, using the Tensorflow container on NVIDIA GPU Cloud. We will be happy to test the fix with cuDNN 7.0.5 and follow-up. Please let us know if there are any other updates on this issue or if more information is needed.

mholt on Nov 29, 2017

The synchronization error is only what finds out the issue. The root cause is some GPU kernels had an illegal address access.

If someone wants to root cause this, first it is needed to find the offending kernel. In our past experience, it could be either a kernel bug, or a degenerate data entry.

zheng-xq on Nov 8, 2017

I have this problem on CUDA 10.0. I’m using TF 1.10.0, keras 2.2.2, Window 10, GPU Nvidia mx150. Some NNs work with no problem, some fail.

majthehero on Oct 4, 2018

Things work for me too now. Thanks @juliebernauer !

ljanyst on Dec 14, 2017

@juliebernauer it works after updating. Thank you a lot!

RerRayne on Dec 14, 2017