tensorflow: tf-gpu==1.13.1 : 35% less batch size before OOM vs tf-gpu==1.11.0
System information
- Windows 7
- TensorFlow installed from (source or binary): pip
- TensorFlow version (use command below): 1.11.0 , 1.13.1
- Python version: 3.6.5
- CUDA/cuDNN version: 9/7.1.4 , 10/7.4.1
- GPU model and memory: GTX 1060 6GB
Describe the current behavior
I have standard AE network with pixel shuffler layer.
on tf.1.11.0-cuda 9
maximum batch size for my GTX 1060 6GB is 132
but after upgrade to tf.1.13.1-cuda 10
tf cannot handle same batch size it produces OOM error
and maximum now 90
for my card.
Describe the expected behavior
expected not to downgrade performance when upgrading tensorflow
Code to reproduce the issue
import numpy as np
import tensorflow as tf
keras = tf.keras
KL = keras.layers
K = keras.backend
bgr_shape = (128, 128, 3)
#batch_size = 132 #max -tf.1.11.0-cuda 9
batch_size = 86 #max -tf.1.13.1-cuda 10
class PixelShuffler(keras.layers.Layer):
def __init__(self, size=(2, 2), data_format=None, **kwargs):
super(PixelShuffler, self).__init__(**kwargs)
self.size = size
def call(self, inputs):
input_shape = K.int_shape(inputs)
if len(input_shape) != 4:
raise ValueError('Inputs should have rank ' +
str(4) +
'; Received input shape:', str(input_shape))
batch_size, h, w, c = input_shape
if batch_size is None:
batch_size = -1
rh, rw = self.size
oh, ow = h * rh, w * rw
oc = c // (rh * rw)
out = K.reshape(inputs, (batch_size, h, w, rh, rw, oc))
out = K.permute_dimensions(out, (0, 1, 3, 2, 4, 5))
out = K.reshape(out, (batch_size, oh, ow, oc))
return out
def compute_output_shape(self, input_shape):
if len(input_shape) != 4:
raise ValueError('Inputs should have rank ' +
str(4) +
'; Received input shape:', str(input_shape))
height = input_shape[1] * self.size[0] if input_shape[1] is not None else None
width = input_shape[2] * self.size[1] if input_shape[2] is not None else None
channels = input_shape[3] // self.size[0] // self.size[1]
if channels * self.size[0] * self.size[1] != input_shape[3]:
raise ValueError('channels of input and size are incompatible')
return (input_shape[0],
height,
width,
channels)
def get_config(self):
config = {'size': self.size}
base_config = super(PixelShuffler, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
def upscale (dim):
def func(x):
return PixelShuffler()((KL.Conv2D(dim * 4, kernel_size=3, strides=1, padding='same')(x)))
return func
inp = KL.Input(bgr_shape)
x = inp
x = KL.Conv2D(128, 5, strides=2, padding='same')(x)
x = KL.Conv2D(256, 5, strides=2, padding='same')(x)
x = KL.Conv2D(512, 5, strides=2, padding='same')(x)
x = KL.Conv2D(1024, 5, strides=2, padding='same')(x)
x = KL.Dense(1024)(KL.Flatten()(x))
x = KL.Dense(8 * 8 * 1024)(x)
x = KL.Reshape((8, 8, 1024))(x)
x = upscale(512)(x)
x = upscale(256)(x)
x = upscale(128)(x)
x = upscale(64)(x)
x = KL.Conv2D(3, 5, strides=1, padding='same')(x)
model = keras.models.Model ([inp], [x])
model.compile(optimizer=keras.optimizers.Adam(lr=5e-5, beta_1=0.5, beta_2=0.999), loss='mae')
training_data = np.zeros ( (batch_size,128,128,3) )
loss = model.train_on_batch( [training_data], [training_data] )
print ("FINE")
Other info / logs
1] 1 Chunks of size 12032 totalling 11.8KiB
2019-02-28 19:45:23.516100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 4 Chunks of size 19200 totalling 75.0KiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 4 Chunks of size 38400 totalling 150.0KiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 4 Chunks of size 262144 totalling 1.00MiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 368640 totalling 360.0KiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 4 Chunks of size 1179648 totalling 4.50MiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 5 Chunks of size 3276800 totalling 15.63MiB
2019-02-28 19:45:23.517100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 4 Chunks of size 4718592 totalling 18.00MiB
2019-02-28 19:45:23.520100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 3 Chunks of size 13107200 totalling 37.50MiB
2019-02-28 19:45:23.520100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 17028352 totalling 16.24MiB
2019-02-28 19:45:23.521100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 17694720 totalling 16.88MiB
2019-02-28 19:45:23.521100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 17694976 totalling 16.88MiB
2019-02-28 19:45:23.521100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 3 Chunks of size 18874368 totalling 54.00MiB
2019-02-28 19:45:23.521100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 23592960 totalling 22.50MiB
2019-02-28 19:45:23.521100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 5 Chunks of size 52428800 totalling 250.00MiB
2019-02-28 19:45:23.529100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 5 Chunks of size 75497472 totalling 360.00MiB
2019-02-28 19:45:23.529100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 94371840 totalling 90.00MiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 100362240 totalling 95.71MiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 2 Chunks of size 188743680 totalling 360.00MiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 194688000 totalling 185.67MiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 12 Chunks of size 268435456 totalling 3.00GiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
1] 1 Chunks of size 552317184 totalling 526.73MiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
5] Sum Total of in-use chunks: 5.02GiB
2019-02-28 19:45:23.530100: I tensorflow/core/common_runtime/bfc_allocator.cc:64
7] Stats:
Limit: 5838622720
InUse: 5393793792
MaxInUse: 5708028928
NumAllocs: 434
MaxAllocSize: 1363673088
2019-02-28 19:45:23.531100: W tensorflow/core/common_runtime/bfc_allocator.cc:27
1] *****************************************************__**********_***********
**********************x
2019-02-28 19:45:23.531100: W tensorflow/core/framework/op_kernel.cc:1401] OP_RE
QUIRES failed at conv_grad_input_ops.cc:1054 : Resource exhausted: OOM when allo
cating tensor with shape[90,128,64,64] and type float on /job:localhost/replica:
0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "D:\DeepFaceLab\_internal\bin\DeepFaceLab\test.py", line 87, in <module>
loss = model.train_on_batch( [training_data], [training_data] )
File "D:\DeepFaceLab\_internal\bin\lib\site-packages\tensorflow\python\keras\e
ngine\training.py", line 1188, in train_on_batch
outputs = self.train_function(ins) # pylint: disable=not-callable
File "D:\DeepFaceLab\_internal\bin\lib\site-packages\tensorflow\python\keras\b
ackend.py", line 3076, in __call__
run_metadata=self.run_metadata)
File "D:\DeepFaceLab\_internal\bin\lib\site-packages\tensorflow\python\client\
session.py", line 1439, in __call__
run_metadata_ptr)
File "D:\DeepFaceLab\_internal\bin\lib\site-packages\tensorflow\python\framewo
rk\errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocat
ing tensor with shape[90,128,64,64] and type float on /job:localhost/replica:0/t
ask:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node training/Adam/gradients/conv2d_1/Conv2D_grad/Conv2DBackpropInp
ut}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add repor
t_tensor_allocations_upon_oom to RunOptions for current allocation info.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 5
- Comments: 19 (7 by maintainers)
@tatianashp are you kidding? reproduced significantly performance downgrade no fix or official comment about it
Sorry for the long delay for the update.
This looks like a bug introduced between cuDNN v7.2 and v7.4. We will report this to NVIDIA and update this issue after that.
Thanks for providing with a small example to reproduce the issue.
cudnn 7.6.0 same problem
Did cudnn 7.6.0 solve this issue?
Compared to 1.12, I’m finding that the exact same code uses about 10% extra GPU memory as per tf.profiler. Specifically, I get about 6400MB usage total (i.e., for _TFProfRoot) on 1.12.0 but about 7100MB for 1.13.1. With a smaller version of the same model, the proportional difference is about the same—about 3450MB for 1.13.1 and about 3100MB for 1.12.0.