tensorflow: element-wise multiplication overflow with large dimension tensors
Click to expand!
Issue Type
Bug
Have you reproduced the bug with TF nightly?
Yes
Source
source
Tensorflow Version
2.9.3
Custom Code
Yes
OS Platform and Distribution
ubuntu 20.04
Mobile device
No response
Python version
3.8
Bazel version
No response
GCC/Compiler version
No response
CUDA/cuDNN version
11.6/8
GPU model and memory
single A100 80G
Current Behaviour?
tested with tensorflow==2.9.3 and numpy==1.24.2 on single A100 80G GPU. If use small memory GPU, you may get OOM before reproducing the issue.
when using dimension (524288, 16, 9, 32), get illegal memory. when using dimension (524288, 16, 8, 32), get Mismatched elements: 1024 / 2147483648 (4.77e-05%) when using dimension (524288, 16, 7, 32), get correct values.
same behavior on eager mode and graph mode. note: one related issue has been reported https://github.com/keras-team/tf-keras/issues/124
Standalone code to reproduce the issue
import tensorflow as tf
import numpy as np
def test_mul_eager(input_shape):
rng = np.random.RandomState(42)
grad = rng.exponential(size=input_shape).astype(np.float32)
grad_loss = rng.exponential(size=(input_shape[0],1,1,1)).astype(np.float32)
with tf.device('/GPU:0'):
tf_grad = tf.convert_to_tensor(grad)
tf_grad_loss = tf.convert_to_tensor(grad_loss)
out = tf_grad * tf_grad_loss
#tf.print("==== shape ", tf_grad.shape, tf_grad_loss.shape, out.shape)
with tf.device('/CPU:0'):
out_cpu = tf.identity(out)
tf_grad_cpu = tf.identity(tf_grad)
tf_grad_loss_cpu = tf.identity(tf_grad_loss)
#
np.testing.assert_allclose(grad, tf_grad_cpu.numpy(), rtol=1e-5, atol=1e-4)
np.testing.assert_allclose(grad_loss, tf_grad_loss_cpu.numpy(), rtol=1e-5, atol=1e-4)
np.testing.assert_allclose(grad*grad_loss, out_cpu.numpy(), rtol=1e-5, atol=1e-4)
@tf.function
def compute_mul(b,t,u,v):
with tf.device('/CPU:0'):
x = tf.random.normal((1,t,u,v), dtype=tf.float32)
y = tf.random.normal((1,1,1,1), dtype=tf.float32)
tf_grad = tf.tile(x, (b,1,1,1))
tf_grad_loss = tf.tile(y, (b,1,1,1))
with tf.device('/GPU:0'):
out = tf_grad * tf_grad_loss
#out = tf.raw_ops.Mul(x=tf_grad, y=tf_grad_loss)
#out = tf.multiply(tf_grad, tf_grad_loss)
with tf.device('/CPU:0'):
out_cpu = tf.identity(out)
tf_grad_cpu = tf.identity(tf_grad)
tf_grad_loss_cpu = tf.identity(tf_grad_loss)
return out_cpu, tf_grad_cpu, tf_grad_loss_cpu
def test_mul_graph(input_shape):
b,t,u,v = input_shape
out_cpu, tf_grad_cpu, tf_grad_loss_cpu = compute_mul(b,t,u,v)
np.testing.assert_allclose(tf_grad_cpu.numpy()*tf_grad_loss_cpu.numpy(), out_cpu.numpy(), rtol=1e-5, atol=1e-4)
if __name__ == '__main__':
#input_shape = (524288, 16, 7, 32) # pass <2^31
input_shape = (524288, 16, 8, 32) # value mismatch at 2^31
#input_shape = (524288, 16, 9, 32) # illegal memory access
#test_mul_eager(input_shape)
test_mul_graph(input_shape)
Relevant log output
when using (524288, 16, 8, 32), got
Traceback (most recent call last):
File "multiplication_mismatch.py", line 93, in <module>
test_mul_graph(input_shape)
File "multiplication_mismatch.py", line 85, in test_mul_graph
np.testing.assert_allclose(tf_grad_cpu.numpy()*tf_grad_loss_cpu.numpy(), out_cpu.numpy(), rtol=1e-5, atol=1e-4)
File "/fs/scratch/work/yu_fang/warp-transducer/venv_tf_29/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1592, in assert_allclose
assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
File "/usr/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/fs/scratch/work/yu_fang/warp-transducer/venv_tf_29/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 862, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-05, atol=0.0001
Mismatched elements: 1024 / 2147483648 (4.77e-05%)
Max absolute difference: 2.2970564
Max relative difference: 0.
x: array([[[[ 1.605678, -0.261173, -1.222985, ..., -1.186496, -0.111071,
0.792078],
[ 0.307934, 0.016565, -0.576156, ..., -0.17745 , -0.993849,...
y: array([[[[ 0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[ 0. , 0. , 0. , ..., 0. , 0. ,...
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 24 (12 by maintainers)
https://github.com/tensorflow/tensorflow/commit/cb2b5456ee5f74d5bacd96672db9251d519e1f02
I did a little extra testing and it turns out the threshold for “large tensors” is wrong. Thanks for finding this! I will land a fix soon. In the meantime, it will work with tensors > 2**32 elements
I would think you do not need the flag at this point since the kernels are enabled by default. I will add a test for mul and see if I can reproduce this. What dtype is this?