tensorflow: tf-gpu==1.13.1 : 35% less batch size before OOM vs tf-gpu==1.11.0

System information

Windows 7
TensorFlow installed from (source or binary): pip
TensorFlow version (use command below): 1.11.0 , 1.13.1
Python version: 3.6.5
CUDA/cuDNN version: 9/7.1.4 , 10/7.4.1
GPU model and memory: GTX 1060 6GB

Describe the current behavior

I have standard AE network with pixel shuffler layer.

on tf.1.11.0-cuda 9 maximum batch size for my GTX 1060 6GB is 132

but after upgrade to tf.1.13.1-cuda 10 tf cannot handle same batch size it produces OOM error and maximum now 90 for my card.

Describe the expected behavior

expected not to downgrade performance when upgrading tensorflow

Code to reproduce the issue

import numpy as np
import tensorflow as tf
keras = tf.keras
KL = keras.layers
K = keras.backend

bgr_shape = (128, 128, 3)
#batch_size = 132 #max -tf.1.11.0-cuda 9
batch_size = 86 #max -tf.1.13.1-cuda 10
 
class PixelShuffler(keras.layers.Layer):
    def __init__(self, size=(2, 2), data_format=None, **kwargs):
        super(PixelShuffler, self).__init__(**kwargs)
        self.size = size

    def call(self, inputs):

        input_shape = K.int_shape(inputs)
        if len(input_shape) != 4:
            raise ValueError('Inputs should have rank ' +
                             str(4) +
                             '; Received input shape:', str(input_shape))


        batch_size, h, w, c = input_shape
        if batch_size is None:
            batch_size = -1
        rh, rw = self.size
        oh, ow = h * rh, w * rw
        oc = c // (rh * rw)

        out = K.reshape(inputs, (batch_size, h, w, rh, rw, oc))
        out = K.permute_dimensions(out, (0, 1, 3, 2, 4, 5))
        out = K.reshape(out, (batch_size, oh, ow, oc))
        return out

    def compute_output_shape(self, input_shape):

        if len(input_shape) != 4:
            raise ValueError('Inputs should have rank ' +
                             str(4) +
                             '; Received input shape:', str(input_shape))


        height = input_shape[1] * self.size[0] if input_shape[1] is not None else None
        width = input_shape[2] * self.size[1] if input_shape[2] is not None else None
        channels = input_shape[3] // self.size[0] // self.size[1]

        if channels * self.size[0] * self.size[1] != input_shape[3]:
            raise ValueError('channels of input and size are incompatible')

        return (input_shape[0],
                height,
                width,
                channels)

    def get_config(self):
        config = {'size': self.size}
        base_config = super(PixelShuffler, self).get_config()

        return dict(list(base_config.items()) + list(config.items()))
        
def upscale (dim):
    def func(x):
        return PixelShuffler()((KL.Conv2D(dim * 4, kernel_size=3, strides=1, padding='same')(x)))
    return func 
            
inp = KL.Input(bgr_shape)
x = inp
x = KL.Conv2D(128, 5, strides=2, padding='same')(x)
x = KL.Conv2D(256, 5, strides=2, padding='same')(x)
x = KL.Conv2D(512, 5, strides=2, padding='same')(x)
x = KL.Conv2D(1024, 5, strides=2, padding='same')(x)
x = KL.Dense(1024)(KL.Flatten()(x))
x = KL.Dense(8 * 8 * 1024)(x)
x = KL.Reshape((8, 8, 1024))(x)
x = upscale(512)(x)
x = upscale(256)(x)
x = upscale(128)(x)
x = upscale(64)(x)
x = KL.Conv2D(3, 5, strides=1, padding='same')(x)

model = keras.models.Model ([inp], [x])
model.compile(optimizer=keras.optimizers.Adam(lr=5e-5, beta_1=0.5, beta_2=0.999), loss='mae')

training_data = np.zeros ( (batch_size,128,128,3) )
loss = model.train_on_batch( [training_data], [training_data] )
print ("FINE")

Other info / logs

1] 1 Chunks of size 12032 totalling 11.8KiB
2019-02-28 19:45:23.516100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 4 Chunks of size 19200 totalling 75.0KiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 4 Chunks of size 38400 totalling 150.0KiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 4 Chunks of size 262144 totalling 1.00MiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 368640 totalling 360.0KiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 4 Chunks of size 1179648 totalling 4.50MiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 5 Chunks of size 3276800 totalling 15.63MiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 4 Chunks of size 4718592 totalling 18.00MiB
2019-02-28 19:45:23.520100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 3 Chunks of size 13107200 totalling 37.50MiB
2019-02-28 19:45:23.520100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 17028352 totalling 16.24MiB
2019-02-28 19:45:23.521100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 17694720 totalling 16.88MiB
2019-02-28 19:45:23.521100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 17694976 totalling 16.88MiB
2019-02-28 19:45:23.521100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 3 Chunks of size 18874368 totalling 54.00MiB
2019-02-28 19:45:23.521100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 23592960 totalling 22.50MiB
2019-02-28 19:45:23.521100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 5 Chunks of size 52428800 totalling 250.00MiB
2019-02-28 19:45:23.529100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 5 Chunks of size 75497472 totalling 360.00MiB
2019-02-28 19:45:23.529100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 94371840 totalling 90.00MiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 100362240 totalling 95.71MiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 2 Chunks of size 188743680 totalling 360.00MiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 194688000 totalling 185.67MiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 12 Chunks of size 268435456 totalling 3.00GiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 552317184 totalling 526.73MiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
5] Sum Total of in-use chunks: 5.02GiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
7] Stats:
Limit:                  5838622720
InUse:                  5393793792
MaxInUse:               5708028928
NumAllocs:                     434
MaxAllocSize:           1363673088

2019-02-28 19:45:23.531100: W tensorflow/core/common_runtime/bfc_allocator.cc:27
1] *****************************************************__**********_***********
**********************x
2019-02-28 19:45:23.531100: W tensorflow/core/framework/op_kernel.cc:1401] OP_RE
QUIRES failed at conv_grad_input_ops.cc:1054 : Resource exhausted: OOM when allo
cating tensor with shape[90,128,64,64] and type float on /job:localhost/replica:
0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "D:\DeepFaceLab\_internal\bin\DeepFaceLab\test.py", line 87, in <module>
    loss = model.train_on_batch( [training_data], [training_data] )
  File "D:\DeepFaceLab\_internal\bin\lib\site-packages\tensorflow\python\keras\e
ngine\training.py", line 1188, in train_on_batch
    outputs = self.train_function(ins)  # pylint: disable=not-callable
  File "D:\DeepFaceLab\_internal\bin\lib\site-packages\tensorflow\python\keras\b
ackend.py", line 3076, in __call__
    run_metadata=self.run_metadata)
  File "D:\DeepFaceLab\_internal\bin\lib\site-packages\tensorflow\python\client\
session.py", line 1439, in __call__
    run_metadata_ptr)
  File "D:\DeepFaceLab\_internal\bin\lib\site-packages\tensorflow\python\framewo
rk\errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocat
ing tensor with shape[90,128,64,64] and type float on /job:localhost/replica:0/t
ask:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node training/Adam/gradients/conv2d_1/Conv2D_grad/Conv2DBackpropInp
ut}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add repor
t_tensor_allocations_upon_oom to RunOptions for current allocation info.

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 5
Comments: 19 (7 by maintainers)

Most upvoted comments

@tatianashp are you kidding? reproduced significantly performance downgrade no fix or official comment about it

iperov on Apr 6, 2019

Sorry for the long delay for the update.

This looks like a bug introduced between cuDNN v7.2 and v7.4. We will report this to NVIDIA and update this issue after that.

Thanks for providing with a small example to reproduce the issue.

smit-hinsu on Apr 25, 2019

cudnn 7.6.0 same problem

iperov on May 31, 2019

last cudnn 7.5.1 also fails

Did cudnn 7.6.0 solve this issue?

Fyllan on May 30, 2019

Compared to 1.12, I’m finding that the exact same code uses about 10% extra GPU memory as per tf.profiler. Specifically, I get about 6400MB usage total (i.e., for _TFProfRoot) on 1.12.0 but about 7100MB for 1.13.1. With a smaller version of the same model, the proportional difference is about the same—about 3450MB for 1.13.1 and about 3100MB for 1.12.0.

rightaditya on Apr 15, 2019