tensorflow-upstream: Seemingly random shape error during gradient calculation
edit: Important point I missed to mention: I did not encounter this issue with CUDA backend.
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mint 19.1
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
- TensorFlow installed from (source or binary): binary (pypi)
- TensorFlow version (use command below): v1.12.0-871-gf480b4a 1.12.0
- Python version: 3.6.7
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- ROCm/MIOpen version: Rocm: 2.1.96, MiOpen: 1.7.1 (both installed through apt)
- GPU model and memory: Radeon VII, 16GB (gfx906)
You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c “import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)”
Describe the current behavior
After training a model for a variable number of epochs, the program throws an exception because of inco,patible shapes during gradient calculation for a tile op inside a tf.while_loop. The exception occurs inside the _TileGrad method, which interleaves the multiples and the shapes of the original tile op by stacking, transposing and reshaping. From the behaviour that I could see by printing the input tensors and intermediate steps in _TileGrad, it seems that something goes wrong during the interleaving. The interleaved shape at times ends up as nonsense like: [949434578 -1198049073 1 16 1 25] , while something like [50 1 1 21 1 25] would be expected.
The output of the transpose at one of these exceptions was:
[[1036548730 1061580315]
[-1110934980 -1085778476]
[-1085903306 1061705196]]
resulting in the following interleaved shape:
[1036548730 1061580315 -1110934980 -1085778476 -1085903306 1061705196]
I wasn’t able to find the related stack output or input shapes, so I can’t tell if the shape error is caused by something further upstream. My reply to this issue includes an example with parallel_iterations=1, including all the steps.
A full stacktrace can be found at the bottom of this issue.
The error is somewhat hard to reproduce and seems to happen at random. I don’t believe it is directly related to tf.while_loop as the exception never occured in an RNN layer.
Describe the expected behavior
No InvalidArgumentError during gradient calculation.
Code to reproduce the issue I ran this code for about 25 minutes before the exception happened. It might not be the minimal code required to reproduce the error, but since it’s not reliably reproducable I can’t narrow it down easily.
import tensorflow as tf
import numpy as np
def loop_cond_dist(i, _l, hs, __ow, _dist):
return tf.less(i, tf.shape(hs)[1])
def loop_body_dist(i, l, hs, out_weights, dist_lookup):
dists = tf.nn.embedding_lookup(dist_lookup, tf.clip_by_value(tf.range(1, limit=tf.shape(hs)[1] - i + 1), 0, 50))
dists = tf.expand_dims(dists, axis=0)
dists = tf.tile(dists, [tf.shape(hs)[0], 1, 1]) #Error seems to happen in gradients for this op
cur = tf.einsum('ijk,kl -> ijl', dists, out_weights, name="out_mul")
pre_pad = tf.zeros([tf.shape(l)[0], tf.shape(l)[1] - tf.reduce_sum(tf.range(tf.shape(hs)[1] - i + 1)), 2])
post_pad = tf.zeros([tf.shape(l)[0], tf.reduce_sum(tf.range(tf.shape(hs)[1] - i)), 2])
cur = tf.concat([pre_pad, cur, post_pad], axis=1)
i += 1
return i, tf.add(l, cur), hs, out_weights, dist_lookup
def build():
dist_lookup = tf.get_variable('distance_embeds', dtype=tf.float32, shape=[51, 25])
hs = tf.placeholder(dtype=tf.float32, shape=[None, None, 50])
out_weights = tf.get_variable('out_weights', dtype=tf.float32, shape=[25, 2])
logits = tf.zeros([50, tf.cast(((tf.shape(hs)[1] * tf.shape(hs)[1]) - tf.shape(hs)[1]) / 2, dtype=tf.float32), 2])
loop_vars = [1, logits, hs, out_weights, dist_lookup]
logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, name='clause_logits')[1]
targets = tf.placeholder(tf.int32)
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=targets, logits=logits)
train = tf.train.AdamOptimizer(0.005).minimize(loss)
return train, targets, hs
if __name__ == "__main__":
with tf.Session() as sess:
train, y, hs = build()
sess.run([tf.global_variables_initializer()])
while True:
timesteps = np.random.randint(low=1, high=150)
targets = np.random.randint(low=0, high=2, size=[50, int((timesteps*timesteps-timesteps)/2)])
rand_hs = np.random.rand(50, timesteps, 50)
_ = sess.run([train], {y: targets, hs: rand_hs})
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Other info / logs
--------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1333 try:
-> 1334 return fn(*args)
1335 except errors.OpError as e:
~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
1318 return self._call_tf_sessionrun(
-> 1319 options, feed_dict, fetch_list, target_list, run_metadata)
1320
~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
1406 self._session, options, feed_dict, fetch_list, target_list,
-> 1407 run_metadata)
1408
InvalidArgumentError: Size 2 must be non-negative, not -1110934980
[[{{node gradients/clause_logits/Tile_grad/Reshape_1}} = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
[[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]
During handling of the above exception, another exception occurred:
InvalidArgumentError Traceback (most recent call last)
~/.cargo/toponn/python/bug.py in <module>
45 targets = np.random.randint(low=0, high=2, size=[50, int((timesteps*timesteps-timesteps)/2)])
46 rand_hs = np.random.rand(50, timesteps, 50)
---> 47 _ = sess.run([train], {y: targets, hs: rand_hs})
48
~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
927 try:
928 result = self._run(None, fetches, feed_dict, options_ptr,
--> 929 run_metadata_ptr)
930 if run_metadata:
931 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1150 if final_fetches or final_targets or (handle and feed_dict_tensor):
1151 results = self._do_run(handle, final_targets, final_fetches,
-> 1152 feed_dict_tensor, options, run_metadata)
1153 else:
1154 results = []
~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
1326 if handle is None:
1327 return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1328 run_metadata)
1329 else:
1330 return self._do_call(_prun_fn, handle, feeds, fetches)
~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1346 pass
1347 message = error_interpolation.interpolate(message, self._graph)
-> 1348 raise type(e)(node_def, op, message)
1349
1350 def _extend_graph(self):
InvalidArgumentError: Size 2 must be non-negative, not -1110934980
[[node gradients/clause_logits/Tile_grad/Reshape_1 (defined at /home/seb/.cargo/toponn/python/bug.py:34) = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
[[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]
Caused by op 'gradients/clause_logits/Tile_grad/Reshape_1', defined at:
File "/home/seb/.pyenv/versions/3.6.7/bin/ipython", line 10, in <module>
sys.exit(start_ipython())
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/__init__.py", line 125, in start_ipython
return launch_new_instance(argv=argv, **kwargs)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
app.initialize(argv)
File "</home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/decorator.py:decorator-gen-112>", line 2, in initialize
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
return method(app, *args, **kwargs)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/terminal/ipapp.py", line 323, in initialize
self.init_code()
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 288, in init_code
self._run_cmd_line_code()
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 408, in _run_cmd_line_code
self._exec_file(fname, shell_futures=True)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 340, in _exec_file
raise_exceptions=True)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2683, in safe_execfile
self.compile if shell_futures else None)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/utils/py3compat.py", line 188, in execfile
exec(compiler(f.read(), fname, 'exec'), glob, loc)
File "/home/seb/.cargo/toponn/python/bug.py", line 39, in <module>
train, y, hs = build()
File "/home/seb/.cargo/toponn/python/bug.py", line 34, in build
train = tf.train.AdamOptimizer(0.005).minimize(loss)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 400, in minimize
grad_loss=grad_loss)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 519, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 674, in gradients
unconnected_gradients)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 864, in _GradientsHelper
lambda: grad_fn(op, *out_grads))
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 409, in _MaybeCompile
return grad_fn() # Exit early
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 864, in <lambda>
lambda: grad_fn(op, *out_grads))
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/array_grad.py", line 599, in _TileGrad
input_grad = math_ops.reduce_sum(array_ops.reshape(grad, split_shape), axes)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6482, in reshape
"Reshape", tensor=tensor, shape=shape, name=name)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
...which was originally created as op 'clause_logits/Tile', defined at:
File "/home/seb/.pyenv/versions/3.6.7/bin/ipython", line 10, in <module>
sys.exit(start_ipython())
[elided 10 identical lines from previous traceback]
File "/home/seb/.cargo/toponn/python/bug.py", line 39, in <module>
train, y, hs = build()
File "/home/seb/.cargo/toponn/python/bug.py", line 29, in build
logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, name='clause_logits', parallel_iterations=250)[1]
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3295, in while_loop
return_same_structure)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3007, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2942, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/home/seb/.cargo/toponn/python/bug.py", line 13, in loop_body_dist
dists = tf.tile(dists, [tf.shape(hs)[0], 1, 1])
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 8805, in tile
"Tile", input=input, multiples=multiples, name=name)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Size 2 must be non-negative, not -1110934980
[[node gradients/clause_logits/Tile_grad/Reshape_1 (defined at /home/seb/.cargo/toponn/python/bug.py:34) = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
[[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 130
/cc @dagamayank for awareness.
@sebpuetz I’m very sorry that this issue lingered much longer than anyone desires. We’ve studied issues from Python API implementation to LLVM code generation to Linux OS signal pool limitation to HBM memory timing before, but I have to admit this particular ticket is a beast beyond what @sunway513 and I have ever encountered.
I’ll work with @dagamayank to see if we can enlist fresh set of eyes/minds to help investigate this issue.
I think I’m getting close.
I see two groups of operations.
The first group allocates a temporary tensor, does some processing and then deallocates it (or, more precisely, returns the memory into the pool):
Shortly thereafter, the second group attempts to upload a tensor from the host to the GPU, and, for that, allocates some GPU memory, which turns out to be exactly the same:
Now, the first group does not actually wait for completion before returning - at least I don’t see any synchronization calls. (Even though the kernels are described as “synchronous”, that does not seem to be accurate. They are all just queued up into the stream.) Therefore it is entirely possible that, at the time 00:01:41.597468, the last kernel from group 1 is still running!
Which brings us to
This is the call that should force the upload stream (the one in which memcpy h2d is executed) to wait for the completion of the execution stream (the one that was executing everything in the first group).
It evaluates to a call to hipEventRecord() on stream 1 followed by a call to hipStreamWaitEvent on stream 2.
If this call were to fail, it would produce the exact symptoms I’m seeing.
The specific reason why it fails eludes me for now, but it does look like it is the culprit (if I replace it with Stream::BlockHostUntilDone(), the crash seems to stop happening).
– UPDATE:
We get here https://github.com/ROCm-Developer-Tools/HIP/blob/master/src/hip_hcc.cpp#L321
which means an “agent-scope fence”
https://github.com/RadeonOpenCompute/hcc/blob/clang_tot_upgrade/lib/hsa/mcwamp_hsa.cpp#L4988
which apparently means that it does not flush the L2 cache. Which is probably wrong and could result in these problems even though the synchronization is otherwise done correctly:
I understand you said no ETA for the fix but did you make any progress? Soon this issue will be 6 months old.
To those experiencing the issue, try to run your code after setting the env var HCC_FORCE_CROSS_QUEUE_FLUSH=3.
HI all, I am seeing the same error on my gfx803 card. Except I am using keras + tensorflow 1.14.
This error did not seem to occur with tf 1.13 for me.
Please let me know if there is any information you need from me that could help.
Thanks
some update on this issue.
after discussing with compiler team and additional testing, it’s unlikely the issue be caused by issue raised at: https://github.com/RadeonOpenCompute/hcc/issues/1114 . That particular issue would fail deterministically but not this one.
I’ve also tried to amend several kernels used in this ticket with the attribute ( amdgpu_flat_work_group_size(1,1024) ), but I couldn’t get the test to fail with/without the attribute so it’s hard to narrow down the issue at this point.
Thus far, based on the logs in this ticket so far, we can guess that:
One theory is that dynamic LDS memory used by certain kernels by applications in this ticket may be incorrect. Unlike
__shared__which are determined at compile-time, those marked asextern __shared__in CUDA andHIP_DYNAMIC_SHAREDin ROCm are assigned by lower-level runtime. I’m looking into this avenue.Your suggestion didn’t break after running for roughly 40 minutes, I then tried a different version that doesn’t contain the
reduce_sumwhich crashed after 30 minutes with a shape error. I’m now running your workaround again to see if it will break eventually.Hi @sebpuetz , I’m trying to reproduce the issue, will update when I got more clues.