tensorflow-upstream: Seemingly random shape error during gradient calculation

edit: Important point I missed to mention: I did not encounter this issue with CUDA backend.

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mint 19.1
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary): binary (pypi)
TensorFlow version (use command below): v1.12.0-871-gf480b4a 1.12.0
Python version: 3.6.7
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
ROCm/MIOpen version: Rocm: 2.1.96, MiOpen: 1.7.1 (both installed through apt)
GPU model and memory: Radeon VII, 16GB (gfx906)

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c “import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)”

Describe the current behavior After training a model for a variable number of epochs, the program throws an exception because of inco,patible shapes during gradient calculation for a tile op inside a tf.while_loop. The exception occurs inside the _TileGrad method, which interleaves the multiples and the shapes of the original tile op by stacking, transposing and reshaping. From the behaviour that I could see by printing the input tensors and intermediate steps in _TileGrad, it seems that something goes wrong during the interleaving. The interleaved shape at times ends up as nonsense like: [949434578 -1198049073 1 16 1 25] , while something like [50 1 1 21 1 25] would be expected.

The output of the transpose at one of these exceptions was:

 [[1036548730 1061580315]
 [-1110934980 -1085778476]
 [-1085903306 1061705196]]

resulting in the following interleaved shape: [1036548730 1061580315 -1110934980 -1085778476 -1085903306 1061705196]

I wasn’t able to find the related stack output or input shapes, so I can’t tell if the shape error is caused by something further upstream. My reply to this issue includes an example with parallel_iterations=1, including all the steps.

A full stacktrace can be found at the bottom of this issue.

The error is somewhat hard to reproduce and seems to happen at random. I don’t believe it is directly related to tf.while_loop as the exception never occured in an RNN layer.

Describe the expected behavior No InvalidArgumentError during gradient calculation.

Code to reproduce the issue I ran this code for about 25 minutes before the exception happened. It might not be the minimal code required to reproduce the error, but since it’s not reliably reproducable I can’t narrow it down easily.

import tensorflow as tf
import numpy as np

def loop_cond_dist(i, _l, hs, __ow, _dist):
    return tf.less(i, tf.shape(hs)[1])


def loop_body_dist(i, l, hs, out_weights, dist_lookup):
    dists = tf.nn.embedding_lookup(dist_lookup, tf.clip_by_value(tf.range(1, limit=tf.shape(hs)[1] - i + 1), 0, 50))
    dists = tf.expand_dims(dists, axis=0)
    dists = tf.tile(dists, [tf.shape(hs)[0], 1, 1]) #Error seems to happen in gradients for this op
    cur = tf.einsum('ijk,kl -> ijl', dists, out_weights, name="out_mul")
    pre_pad = tf.zeros([tf.shape(l)[0], tf.shape(l)[1] - tf.reduce_sum(tf.range(tf.shape(hs)[1] - i + 1)), 2])
    post_pad = tf.zeros([tf.shape(l)[0], tf.reduce_sum(tf.range(tf.shape(hs)[1] - i)), 2])
    cur = tf.concat([pre_pad, cur, post_pad], axis=1)
    i += 1
    return i, tf.add(l, cur), hs, out_weights, dist_lookup

def build():
    dist_lookup = tf.get_variable('distance_embeds', dtype=tf.float32, shape=[51, 25])
    hs = tf.placeholder(dtype=tf.float32, shape=[None, None, 50])
    out_weights = tf.get_variable('out_weights', dtype=tf.float32, shape=[25, 2])
    logits = tf.zeros([50, tf.cast(((tf.shape(hs)[1] * tf.shape(hs)[1]) - tf.shape(hs)[1]) / 2, dtype=tf.float32), 2])
    loop_vars = [1, logits, hs, out_weights, dist_lookup]
    logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, name='clause_logits')[1]

    targets = tf.placeholder(tf.int32)

    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=targets, logits=logits)
    train = tf.train.AdamOptimizer(0.005).minimize(loss)
    return train, targets, hs

if __name__ == "__main__":
    with tf.Session() as sess:
        train, y, hs = build()
        sess.run([tf.global_variables_initializer()])
        while True:
            timesteps = np.random.randint(low=1, high=150)
            targets = np.random.randint(low=0, high=2, size=[50, int((timesteps*timesteps-timesteps)/2)])
            rand_hs = np.random.rand(50, timesteps, 50)
            _ = sess.run([train], {y: targets, hs: rand_hs})

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

--------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1333     try:
-> 1334       return fn(*args)
   1335     except errors.OpError as e:

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
   1318       return self._call_tf_sessionrun(
-> 1319           options, feed_dict, fetch_list, target_list, run_metadata)
   1320 

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
   1406         self._session, options, feed_dict, fetch_list, target_list,
-> 1407         run_metadata)
   1408 

InvalidArgumentError: Size 2 must be non-negative, not -1110934980
	 [[{{node gradients/clause_logits/Tile_grad/Reshape_1}} = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
	 [[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]

During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
~/.cargo/toponn/python/bug.py in <module>
     45             targets = np.random.randint(low=0, high=2, size=[50, int((timesteps*timesteps-timesteps)/2)])
     46             rand_hs = np.random.rand(50, timesteps, 50)
---> 47             _ = sess.run([train], {y: targets, hs: rand_hs})
     48 

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    927     try:
    928       result = self._run(None, fetches, feed_dict, options_ptr,
--> 929                          run_metadata_ptr)
    930       if run_metadata:
    931         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1150     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1151       results = self._do_run(handle, final_targets, final_fetches,
-> 1152                              feed_dict_tensor, options, run_metadata)
   1153     else:
   1154       results = []

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1326     if handle is None:
   1327       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1328                            run_metadata)
   1329     else:
   1330       return self._do_call(_prun_fn, handle, feeds, fetches)

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1346           pass
   1347       message = error_interpolation.interpolate(message, self._graph)
-> 1348       raise type(e)(node_def, op, message)
   1349 
   1350   def _extend_graph(self):

InvalidArgumentError: Size 2 must be non-negative, not -1110934980
	 [[node gradients/clause_logits/Tile_grad/Reshape_1 (defined at /home/seb/.cargo/toponn/python/bug.py:34)  = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
	 [[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]

Caused by op 'gradients/clause_logits/Tile_grad/Reshape_1', defined at:
  File "/home/seb/.pyenv/versions/3.6.7/bin/ipython", line 10, in <module>
    sys.exit(start_ipython())
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/__init__.py", line 125, in start_ipython
    return launch_new_instance(argv=argv, **kwargs)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
    app.initialize(argv)
  File "</home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/decorator.py:decorator-gen-112>", line 2, in initialize
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/terminal/ipapp.py", line 323, in initialize
    self.init_code()
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 288, in init_code
    self._run_cmd_line_code()
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 408, in _run_cmd_line_code
    self._exec_file(fname, shell_futures=True)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 340, in _exec_file
    raise_exceptions=True)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2683, in safe_execfile
    self.compile if shell_futures else None)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/utils/py3compat.py", line 188, in execfile
    exec(compiler(f.read(), fname, 'exec'), glob, loc)

  File "/home/seb/.cargo/toponn/python/bug.py", line 39, in <module>
    train, y, hs = build()
  File "/home/seb/.cargo/toponn/python/bug.py", line 34, in build
    train = tf.train.AdamOptimizer(0.005).minimize(loss)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 400, in minimize
    grad_loss=grad_loss)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 519, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 674, in gradients
    unconnected_gradients)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 864, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 409, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 864, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/array_grad.py", line 599, in _TileGrad
    input_grad = math_ops.reduce_sum(array_ops.reshape(grad, split_shape), axes)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6482, in reshape
    "Reshape", tensor=tensor, shape=shape, name=name)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

...which was originally created as op 'clause_logits/Tile', defined at:
  File "/home/seb/.pyenv/versions/3.6.7/bin/ipython", line 10, in <module>
    sys.exit(start_ipython())
[elided 10 identical lines from previous traceback]
  File "/home/seb/.cargo/toponn/python/bug.py", line 39, in <module>
    train, y, hs = build()
  File "/home/seb/.cargo/toponn/python/bug.py", line 29, in build
    logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, name='clause_logits', parallel_iterations=250)[1]
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3295, in while_loop
    return_same_structure)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3007, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2942, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/home/seb/.cargo/toponn/python/bug.py", line 13, in loop_body_dist
    dists = tf.tile(dists, [tf.shape(hs)[0], 1, 1])
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 8805, in tile
    "Tile", input=input, multiples=multiples, name=name)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Size 2 must be non-negative, not -1110934980
	 [[node gradients/clause_logits/Tile_grad/Reshape_1 (defined at /home/seb/.cargo/toponn/python/bug.py:34)  = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
	 [[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 130

Commits related to this issue

LLVMFuncOp: implement addEntryBlock This function has been declared as a part of the LLVMFuncOp interface but never implemented. Closes #325. PiperOrigin-RevId: 286439619 Change-Id: I7f1c8c5c20ef24... — committed to ROCm/tensorflow-upstream by tensorflower-gardener 5 years ago

Most upvoted comments

/cc @dagamayank for awareness.

@sebpuetz I’m very sorry that this issue lingered much longer than anyone desires. We’ve studied issues from Python API implementation to LLVM code generation to Linux OS signal pool limitation to HBM memory timing before, but I have to admit this particular ticket is a beast beyond what @sunway513 and I have ever encountered.

I’ll work with @dagamayank to see if we can enlist fresh set of eyes/minds to help investigate this issue.

whchung on Jun 27, 2019

I think I’m getting close.

I see two groups of operations.

The first group allocates a temporary tensor, does some processing and then deallocates it (or, more precisely, returns the memory into the pool):

> 2019-10-20 00:01:41.596335: I tensorflow/core/common_runtime/executor.cc:1756] Process node: 2733 step 290 {{node transformer/parallel_0_5/transformer/transformer/body/decoder/layer_3/self_attention/multihead_attention/split_heads/transpose}} = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost
> /replica:0/task:0/device:GPU:0"](transformer/parallel_0_5/transformer/transformer/body/decoder/layer_3/self_attention/multihead_attention/split_heads/split_last_dimension/Reshape, transformer/parallel_0_5/transformer/transformer/body/decoder/layer_4/self_attention/multihead_attention/dot_product_attention
> /split_heads/transpose/perm) device: /job:localhost/replica:0/task:0/device:GPU:0
> 2019-10-20 00:01:41.596358: I tensorflow/core/framework/log_memory.cc:34] __LOG_MEMORY__ MemoryLogTensorAllocation { step_id: 290 kernel_name: "transformer/parallel_0_5/transformer/transformer/body/decoder/layer_3/self_attention/multihead_attention/split_heads/transpose" tensor { dtype: DT_FLOAT shape { d
> im { size: 210 } dim { size: 8 } dim { size: 16 } dim { size: 32 } } allocation_description { requested_bytes: 3440640 allocated_bytes: 3440640 allocator_name: "GPU_0_bfc" allocation_id: 98585 has_single_reference: true ptr: 139650873238528 } } }
> 2019-10-20 00:01:41.596391: I tensorflow/core/framework/log_memory.cc:34] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: 290 kernel_name: "transformer/parallel_0_5/transformer/transformer/body/decoder/layer_3/self_attention/multihead_attention/split_heads/transpose" tensor { dtype: DT_FLOAT shape { dim {
>  size: 210 } dim { size: 8 } dim { size: 16 } dim { size: 32 } } allocation_description { requested_bytes: 3440640 allocated_bytes: 3440640 allocator_name: "GPU_0_bfc" allocation_id: 98585 has_single_reference: true ptr: 139650873238528 } } }
> 2019-10-20 00:01:41.596402: I tensorflow/core/common_runtime/executor.cc:1935] Synchronous kernel done: 2733 step 290
> 2019-10-20 00:01:41.596531: I tensorflow/core/common_runtime/executor.cc:1756] Process node: 2736 step 290 {{node transformer/parallel_0_5/transformer/transformer/body/decoder/layer_3/self_attention/multihead_attention/mul_3}} = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](trans
> former/parallel_0_5/transformer/transformer/body/decoder/layer_3/self_attention/multihead_attention/split_heads/transpose, transformer/parallel_0_5/transformer/transformer/body/encoder/layer_0/self_attention/multihead_attention/mul_3/y) device: /job:localhost/replica:0/task:0/device:GPU:0
> 2019-10-20 00:01:41.596563: I tensorflow/core/framework/log_memory.cc:34] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: 290 kernel_name: "transformer/parallel_0_5/transformer/transformer/body/decoder/layer_3/self_attention/multihead_attention/mul_3" tensor { dtype: DT_FLOAT shape { dim { size: 210 } dim
>  { size: 8 } dim { size: 16 } dim { size: 32 } } allocation_description { requested_bytes: 3440640 allocated_bytes: 3440640 allocator_name: "GPU_0_bfc" allocation_id: 98585 ptr: 139650873238528 } } }
> 2019-10-20 00:01:41.596574: I tensorflow/core/common_runtime/executor.cc:1935] Synchronous kernel done: 2736 step 290
> (...)
> 2019-10-20 00:01:41.596931: I tensorflow/core/framework/log_memory.cc:34] __LOG_MEMORY__ MemoryLogTensorDeallocation { allocation_id: 98585 allocator_name: "GPU_0_bfc" }
>

Shortly thereafter, the second group attempts to upload a tensor from the host to the GPU, and, for that, allocates some GPU memory, which turns out to be exactly the same:


2019-10-20 00:01:41.597429: I tensorflow/core/common_runtime/executor.cc:1756] Process node: 2751 step 290 {{node transformer/parallel_0_5/transformer/transformer/body/decoder/layer_3/self_attention/multihead_attention/dot_product_attention/einsum/strided_slice_1/_1507}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5859_...ed_slice_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](^transformer/parallel_0_5/transformer/transformer/body/decoder/layer_3/self_attention/multihead_attention/dot_product_attention/einsum/strided_slice_1/_1506) device: /job:localhost/replica:0/task:0/device:GPU:0
2019-10-20 00:01:41.597454: I tensorflow/core/framework/log_memory.cc:34] __LOG_MEMORY__ MemoryLogTensorAllocation { step_id: -6 kernel_name: "Unknown (with attributes)" tensor { dtype: DT_INT32 shape { } allocation_description { requested_bytes: 4 allocated_bytes: 256 allocator_name: "GPU_0_bfc" allocation_id: 98589 has_single_reference: true ptr: 139650873238528 } } }
2019-10-20 00:01:41.597468: I tensorflow/stream_executor/stream.cc:1983] [stream=0x555ac6bd1db0,impl=0x555ac6734990] Called Stream::ThenWaitFor(other=0x555ac6bd1bf0)
2019-10-20 00:01:41.597489: I tensorflow/stream_executor/stream.cc:4933] [stream=0x555ac6bd1db0,impl=0x555ac6734990] Called Stream::ThenMemcpy(gpu_dst=0x7f0300b12400, host_src=0x7f03e0202f00, size=4)
2019-10-20 00:01:41.597510: I tensorflow/stream_executor/rocm/rocm_driver.cc:988] successfully enqueued async memcpy h2d of 4 bytes from 0x7f03e0202f00 to 0x7f0300b12400 on stream 0x555ac5ff93a0 starting with 16

Now, the first group does not actually wait for completion before returning - at least I don’t see any synchronization calls. (Even though the kernels are described as “synchronous”, that does not seem to be accurate. They are all just queued up into the stream.) Therefore it is entirely possible that, at the time 00:01:41.597468, the last kernel from group 1 is still running!

Which brings us to

Called Stream::ThenWaitFor(other=0x555ac6bd1bf0)

This is the call that should force the upload stream (the one in which memcpy h2d is executed) to wait for the completion of the execution stream (the one that was executing everything in the first group).

It evaluates to a call to hipEventRecord() on stream 1 followed by a call to hipStreamWaitEvent on stream 2.

If this call were to fail, it would produce the exact symptoms I’m seeing.

The specific reason why it fails eludes me for now, but it does look like it is the culprit (if I replace it with Stream::BlockHostUntilDone(), the crash seems to stop happening).

– UPDATE:

We get here https://github.com/ROCm-Developer-Tools/HIP/blob/master/src/hip_hcc.cpp#L321

void ihipStream_t::locked_streamWaitEvent(ihipEventData_t& ecd) {
    LockedAccessor_StreamCrit_t crit(_criticalData);

    crit->_av.create_blocking_marker(ecd.marker(), hc::accelerator_scope);
}

which means an “agent-scope fence”

https://github.com/RadeonOpenCompute/hcc/blob/clang_tot_upgrade/lib/hsa/mcwamp_hsa.cpp#L4988

which apparently means that it does not flush the L2 cache. Which is probably wrong and could result in these problems even though the synchronization is otherwise done correctly:

Group 1 deallocates the memory, but its data is still in the L2 cache
Group 2 tries to write to the same memory without invalidating the cache (?)
Subsequent attempts to access the memory result in cache hits and pull old data

ekuznetsov139 on Oct 20, 2019

@sebpuetz We are working on this issue and will provide an update here when we have a fix. Unfortunately, we cannot provide an ETA for the fix at the moment.

I understand you said no ETA for the fix but did you make any progress? Soon this issue will be 6 months old.

twuebi on Aug 4, 2019

To those experiencing the issue, try to run your code after setting the env var HCC_FORCE_CROSS_QUEUE_FLUSH=3.

ekuznetsov139 on Oct 20, 2019

HI all, I am seeing the same error on my gfx803 card. Except I am using keras + tensorflow 1.14.

This error did not seem to occur with tf 1.13 for me.

Please let me know if there is any information you need from me that could help.

Thanks

thegatsbylofiexperience on Jul 11, 2019

some update on this issue.

after discussing with compiler team and additional testing, it’s unlikely the issue be caused by issue raised at: https://github.com/RadeonOpenCompute/hcc/issues/1114 . That particular issue would fail deterministically but not this one.

I’ve also tried to amend several kernels used in this ticket with the attribute ( amdgpu_flat_work_group_size(1,1024) ), but I couldn’t get the test to fail with/without the attribute so it’s hard to narrow down the issue at this point.

Thus far, based on the logs in this ticket so far, we can guess that:

some form of memory corruption caused the applications to fail
serializing kernel execution prevents the issue from happening

One theory is that dynamic LDS memory used by certain kernels by applications in this ticket may be incorrect. Unlike __shared__ which are determined at compile-time, those marked as extern __shared__ in CUDA and HIP_DYNAMIC_SHARED in ROCm are assigned by lower-level runtime. I’m looking into this avenue.

whchung on May 8, 2019

Your suggestion didn’t break after running for roughly 40 minutes, I then tried a different version that doesn’t contain the reduce_sum which crashed after 30 minutes with a shape error. I’m now running your workaround again to see if it will break eventually.

import tensorflow as tf
import numpy as np


def loop_cond_dist(i, _l, _x):
    return tf.less(i, 200)


def loop_body_dist(i, l, x):
    lookup = tf.nn.embedding_lookup(x, tf.zeros(200 - i, dtype=tf.int32))
    cur = tf.tile(tf.expand_dims(lookup, axis=0), [50, 1, 1]) #Error seems to happen in gradients for this op
    return i + 1, tf.concat([l, cur], axis=1), x

def build():
    x = tf.get_variable("x", dtype=tf.float32, shape=[200, 2])
    logits = tf.zeros([50, 0, 2])
    loop_vars = [tf.constant(1), logits, x]
    shape_invariants = [tf.TensorShape(None), tf.TensorShape([50, None, 2]), tf.TensorShape([200, 2])]
    logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, shape_invariants=shape_invariants, parallel_iterations=200)[1]
    targets = tf.placeholder(tf.int32)

    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=targets, logits=logits)
    train = tf.train.AdamOptimizer(0.005).minimize(loss)
    return train, targets


if __name__ == "__main__":
    with tf.Session() as sess:
        train, y = build()
        sess.run([tf.global_variables_initializer()])
        while True:
            targets = np.random.randint(low=0, high=2, size=[50, 19900])
            _ = sess.run([train], {y: targets})

sebpuetz on Feb 21, 2019

Hi @sebpuetz , I’m trying to reproduce the issue, will update when I got more clues.

sunway513 on Feb 19, 2019