tensorflow: TF 2.3 training slowed down by 15% compared to 2.2

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.3.0, 2.4.0.dev20200728
Python version: 3.7.8
CUDA/cuDNN version: 10.1 / 7.6.5.32
GPU model and memory: NVIDIA V100 on 12 vCPUs, 40 GB memory GCP node

Describe the current behavior

When upgrading from TensorFlow 2.2.0 to 2.3.0 we observed a 15 - 18% slow down in training speed for our workloads. Unfortunately I wasn’t able to find an easy to reproduce example before the stable release was cut, but below is a code example that illustrates the performance degradation.

When running the training script on a single NVIDIA V100 a 15% performance loss compared to 2.2 can be observed which still is noticable in the latest nightly:

version	epoch time	step time	GPU idle time
2.2.0	34 s	124.3 ms	19.7 ms (15.6 %)
2.3.0	39 s	141.9 ms	37.2 ms (26.1 %)
2.4.0.dev20200728	38s	136.2 ms	31.6 ms (23.2 %)

On Device: total self-time (grouped by type)

2.2.0	2.3.0	2.4.0.dev20200728

The example uses auto mixed precision, but the slowdown can also be observed when running in float32 or in multi-GPU training. When looking at the generated execution profile the slowdown can be explained by an increased idle time of the GPU. Since the training data is cached in memory there should be no IO bottleneck so I am not sure if this performance regression is caused by tf.data or by the runtime itself.

Describe the expected behavior

TensorFlow 2.3 should show equally fast training performance compared to 2.2.

Standalone code to reproduce the issue

import tensorflow as tf
import tensorflow_datasets as tfds

batch_size = 64


def _decode_and_center_crop(image_bytes):
    """Crops to center of image with padding then scales image_size."""
    shape = tf.image.extract_jpeg_shape(image_bytes)
    image_height, image_width, image_size = shape[0], shape[1], 224

    padded_center_crop_size = tf.cast(
        (
            (image_size / (image_size + 32))
            * tf.cast(tf.minimum(image_height, image_width), tf.float32)
        ),
        tf.int32,
    )

    offset_height = ((image_height - padded_center_crop_size) + 1) // 2
    offset_width = ((image_width - padded_center_crop_size) + 1) // 2
    crop_window = tf.stack(
        [offset_height, offset_width, padded_center_crop_size, padded_center_crop_size]
    )
    image = tf.image.decode_and_crop_jpeg(image_bytes, crop_window, channels=3)
    return tf.image.resize(image, [image_size, image_size], method="bicubic")


def preprocessing(data):
    return (
        tf.cast(_decode_and_center_crop(data["image"]), tf.float32),
        data["label"],
    )


dataset = tfds.load(
    "imagenette", decoders={"image": tfds.decode.SkipDecoding()}, split="train",
)

dataset = (
    dataset.cache()
    .repeat(2)  # Artificially increase time per epoch to make it easier to measure
    .map(preprocessing, num_parallel_calls=tf.data.experimental.AUTOTUNE)
    .batch(batch_size)
    .prefetch(1)
)

with tf.distribute.MirroredStrategy().scope():
    model = tf.keras.applications.ResNet50(weights=None)

    model.compile(
        optimizer=tf.train.experimental.enable_mixed_precision_graph_rewrite(
            tf.keras.optimizers.Adam(), loss_scale="dynamic"
        ),
        loss="sparse_categorical_crossentropy",
    )

tb_cbk = tf.keras.callbacks.TensorBoard(f"logs/{tf.__version__}", profile_batch=300)
model.fit(dataset, verbose=2, epochs=3, callbacks=[tb_cbk])

Other info / logs

TensorBoard profiles for the runs mentioned above are available at tb-profile.zip

@mihaimaruseac @jsimsa @guptapriya do you mind taking a look at this?

About this issue

Original URL
State: open
Created 4 years ago
Comments: 33 (32 by maintainers)

Most upvoted comments

Yes, we found some unintended host to device copies caused by a previous change, that I am trying to eliminate.

jaingaurav on Aug 7, 2020

Thanks for the update. I will continue using drop_remainder=True for now as a workaround. I hope there will be a fix for it soon since I think MirroredStrategy together with the default batching is a quite common use case for people running on GPUs.

It would be awesome if it is possible to add this example (or a similar one using MirroredStrategy and a large cached dataset) to your internal regression testing suite. In a lot of the TF version upgrades I have done in the past I discovered some sort of memory issue or performance regression that was reproducible with code very similar the example mentioned above (See #36240, #38617, #38655). It would be excellent if issues like that would be caught automatically so they don’t make it into the stable releases.

lgeiger on Mar 31, 2021

@lgeiger: Yes this is currently planned for the 2.4 release. However the fix is still being worked on.

jaingaurav on Sep 2, 2020

Hey @lgeiger i believe the fix did not make it to 2.4 unfortunately. @rohan100jain @zongweiz @goldiegadde please correct if that is not the case.

guptapriya on Nov 30, 2020

I’m sorry we looked into this issue and there isn’t really any easy way of fixing this without rolling back a change (https://github.com/tensorflow/tensorflow/commit/f0d0485b0de521ff273c8d91acbb2fbabe57baa7) that enhances dtype coverage of our GPU ops and improves the consistency of Tensorflow in general. This issue has exposed some problems we need to fix with our device placement that we’re planning to work on and will have an RFC for it. I’ll therefore recommend that you continue to use the drop_remainder=True workaround for now.

rohan100jain on Mar 31, 2021

@jaingaurav is looking into a potential fix I believe.

guptapriya on Aug 7, 2020

Thanks @lgeiger for the update, that helps a lot. I can confirm that we have verified the regression (looks like it happened sometime back in April) We will update here when we have info into the root cause and fixes.

guptapriya on Jul 29, 2020

Thanks for the report @lgeiger . We will look more into this. The code sample uses MirroredStrategy - so I wanted to clarify when you said this regression is noticed on a single GPU as well - is that with or without MirroredStrategy? If latter, we would start investigating that case first (i.e. 1 GPU, no distribution, no mixed precision).

guptapriya on Jul 29, 2020

I was able to reproduce the issue. Here is the gist…

geetachavan1 on Jul 28, 2020