tensorflow: element-wise multiplication overflow with large dimension tensors

Click to expand!

Issue Type

Bug

Have you reproduced the bug with TF nightly?

Yes

Source

source

Tensorflow Version

2.9.3

Custom Code

Yes

OS Platform and Distribution

ubuntu 20.04

Mobile device

No response

Python version

3.8

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

11.6/8

GPU model and memory

single A100 80G

Current Behaviour?

tested with tensorflow==2.9.3 and numpy==1.24.2 on single A100 80G GPU. If use small memory GPU, you may get OOM before reproducing the issue.

when using dimension (524288, 16, 9, 32), get illegal memory. when using dimension (524288, 16, 8, 32), get Mismatched elements: 1024 / 2147483648 (4.77e-05%) when using dimension (524288, 16, 7, 32), get correct values.

same behavior on eager mode and graph mode. note: one related issue has been reported https://github.com/keras-team/tf-keras/issues/124

Standalone code to reproduce the issue

import tensorflow as tf
import numpy as np

def test_mul_eager(input_shape):
    rng = np.random.RandomState(42)

    grad = rng.exponential(size=input_shape).astype(np.float32)
    grad_loss = rng.exponential(size=(input_shape[0],1,1,1)).astype(np.float32)
    
    with tf.device('/GPU:0'):
        tf_grad = tf.convert_to_tensor(grad)
        tf_grad_loss = tf.convert_to_tensor(grad_loss)
        out = tf_grad * tf_grad_loss
        #tf.print("==== shape ", tf_grad.shape, tf_grad_loss.shape, out.shape)


    with tf.device('/CPU:0'):
        out_cpu = tf.identity(out)
        tf_grad_cpu = tf.identity(tf_grad)
        tf_grad_loss_cpu = tf.identity(tf_grad_loss)
    #
    np.testing.assert_allclose(grad, tf_grad_cpu.numpy(), rtol=1e-5, atol=1e-4)
    np.testing.assert_allclose(grad_loss, tf_grad_loss_cpu.numpy(), rtol=1e-5, atol=1e-4)
    np.testing.assert_allclose(grad*grad_loss, out_cpu.numpy(), rtol=1e-5, atol=1e-4)

@tf.function
def compute_mul(b,t,u,v):
    with tf.device('/CPU:0'):
        x = tf.random.normal((1,t,u,v), dtype=tf.float32)
        y = tf.random.normal((1,1,1,1), dtype=tf.float32)
        tf_grad = tf.tile(x, (b,1,1,1))
        tf_grad_loss = tf.tile(y, (b,1,1,1))
    with tf.device('/GPU:0'):
        out = tf_grad * tf_grad_loss
        #out = tf.raw_ops.Mul(x=tf_grad, y=tf_grad_loss)
        #out = tf.multiply(tf_grad, tf_grad_loss)
    with tf.device('/CPU:0'):
        out_cpu = tf.identity(out)
        tf_grad_cpu = tf.identity(tf_grad)
        tf_grad_loss_cpu = tf.identity(tf_grad_loss)
    
    return out_cpu, tf_grad_cpu, tf_grad_loss_cpu

def test_mul_graph(input_shape):
    b,t,u,v = input_shape
    out_cpu, tf_grad_cpu, tf_grad_loss_cpu = compute_mul(b,t,u,v)
    
    np.testing.assert_allclose(tf_grad_cpu.numpy()*tf_grad_loss_cpu.numpy(), out_cpu.numpy(), rtol=1e-5, atol=1e-4)

if __name__ == '__main__':
    #input_shape = (524288, 16, 7, 32) # pass <2^31
    input_shape = (524288, 16, 8, 32) # value mismatch at 2^31 
    #input_shape = (524288, 16, 9, 32) # illegal memory access
    
    #test_mul_eager(input_shape)
    test_mul_graph(input_shape)

Relevant log output

when using (524288, 16, 8, 32), got
Traceback (most recent call last):
  File "multiplication_mismatch.py", line 93, in <module>
    test_mul_graph(input_shape)
  File "multiplication_mismatch.py", line 85, in test_mul_graph
    np.testing.assert_allclose(tf_grad_cpu.numpy()*tf_grad_loss_cpu.numpy(), out_cpu.numpy(), rtol=1e-5, atol=1e-4)
  File "/fs/scratch/work/yu_fang/warp-transducer/venv_tf_29/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1592, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/fs/scratch/work/yu_fang/warp-transducer/venv_tf_29/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 862, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-05, atol=0.0001

Mismatched elements: 1024 / 2147483648 (4.77e-05%)
Max absolute difference: 2.2970564
Max relative difference: 0.
 x: array([[[[ 1.605678, -0.261173, -1.222985, ..., -1.186496, -0.111071,
           0.792078],
         [ 0.307934,  0.016565, -0.576156, ..., -0.17745 , -0.993849,...
 y: array([[[[ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
           0.      ],
         [ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,...

About this issue

Original URL
State: closed
Created a year ago
Comments: 24 (12 by maintainers)

Most upvoted comments

https://github.com/tensorflow/tensorflow/commit/cb2b5456ee5f74d5bacd96672db9251d519e1f02

frgossen on Jul 7, 2023

I did a little extra testing and it turns out the threshold for “large tensors” is wrong. Thanks for finding this! I will land a fix soon. In the meantime, it will work with tensors > 2**32 elements

frgossen on Jul 7, 2023

I would think you do not need the flag at this point since the kernels are enabled by default. I will add a test for mul and see if I can reproduce this. What dtype is this?

frgossen on Jul 6, 2023