tensorflow: TF 2.3 training slowed down by 15% compared to 2.2
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): 2.3.0, 2.4.0.dev20200728
- Python version: 3.7.8
- CUDA/cuDNN version: 10.1 / 7.6.5.32
- GPU model and memory: NVIDIA V100 on 12 vCPUs, 40 GB memory GCP node
Describe the current behavior
When upgrading from TensorFlow 2.2.0 to 2.3.0 we observed a 15 - 18% slow down in training speed for our workloads. Unfortunately I wasn’t able to find an easy to reproduce example before the stable release was cut, but below is a code example that illustrates the performance degradation.
When running the training script on a single NVIDIA V100 a 15% performance loss compared to 2.2 can be observed which still is noticable in the latest nightly:
version | epoch time | step time | GPU idle time |
---|---|---|---|
2.2.0 | 34 s | 124.3 ms | 19.7 ms (15.6 %) |
2.3.0 | 39 s | 141.9 ms | 37.2 ms (26.1 %) |
2.4.0.dev20200728 | 38s | 136.2 ms | 31.6 ms (23.2 %) |
On Device: total self-time (grouped by type)
2.2.0 | 2.3.0 | 2.4.0.dev20200728 |
---|---|---|
![]() |
![]() |
![]() |
The example uses auto mixed precision, but the slowdown can also be observed when running in float32
or in multi-GPU training. When looking at the generated execution profile the slowdown can be explained by an increased idle time of the GPU. Since the training data is cached in memory there should be no IO bottleneck so I am not sure if this performance regression is caused by tf.data
or by the runtime itself.
Describe the expected behavior
TensorFlow 2.3 should show equally fast training performance compared to 2.2.
Standalone code to reproduce the issue
import tensorflow as tf
import tensorflow_datasets as tfds
batch_size = 64
def _decode_and_center_crop(image_bytes):
"""Crops to center of image with padding then scales image_size."""
shape = tf.image.extract_jpeg_shape(image_bytes)
image_height, image_width, image_size = shape[0], shape[1], 224
padded_center_crop_size = tf.cast(
(
(image_size / (image_size + 32))
* tf.cast(tf.minimum(image_height, image_width), tf.float32)
),
tf.int32,
)
offset_height = ((image_height - padded_center_crop_size) + 1) // 2
offset_width = ((image_width - padded_center_crop_size) + 1) // 2
crop_window = tf.stack(
[offset_height, offset_width, padded_center_crop_size, padded_center_crop_size]
)
image = tf.image.decode_and_crop_jpeg(image_bytes, crop_window, channels=3)
return tf.image.resize(image, [image_size, image_size], method="bicubic")
def preprocessing(data):
return (
tf.cast(_decode_and_center_crop(data["image"]), tf.float32),
data["label"],
)
dataset = tfds.load(
"imagenette", decoders={"image": tfds.decode.SkipDecoding()}, split="train",
)
dataset = (
dataset.cache()
.repeat(2) # Artificially increase time per epoch to make it easier to measure
.map(preprocessing, num_parallel_calls=tf.data.experimental.AUTOTUNE)
.batch(batch_size)
.prefetch(1)
)
with tf.distribute.MirroredStrategy().scope():
model = tf.keras.applications.ResNet50(weights=None)
model.compile(
optimizer=tf.train.experimental.enable_mixed_precision_graph_rewrite(
tf.keras.optimizers.Adam(), loss_scale="dynamic"
),
loss="sparse_categorical_crossentropy",
)
tb_cbk = tf.keras.callbacks.TensorBoard(f"logs/{tf.__version__}", profile_batch=300)
model.fit(dataset, verbose=2, epochs=3, callbacks=[tb_cbk])
Other info / logs
TensorBoard profiles for the runs mentioned above are available at tb-profile.zip
@mihaimaruseac @jsimsa @guptapriya do you mind taking a look at this?
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 33 (32 by maintainers)
Yes, we found some unintended host to device copies caused by a previous change, that I am trying to eliminate.
Thanks for the update. I will continue using
drop_remainder=True
for now as a workaround. I hope there will be a fix for it soon since I thinkMirroredStrategy
together with the default batching is a quite common use case for people running on GPUs.It would be awesome if it is possible to add this example (or a similar one using MirroredStrategy and a large cached dataset) to your internal regression testing suite. In a lot of the TF version upgrades I have done in the past I discovered some sort of memory issue or performance regression that was reproducible with code very similar the example mentioned above (See #36240, #38617, #38655). It would be excellent if issues like that would be caught automatically so they don’t make it into the stable releases.
@lgeiger: Yes this is currently planned for the 2.4 release. However the fix is still being worked on.
Hey @lgeiger i believe the fix did not make it to 2.4 unfortunately. @rohan100jain @zongweiz @goldiegadde please correct if that is not the case.
I’m sorry we looked into this issue and there isn’t really any easy way of fixing this without rolling back a change (https://github.com/tensorflow/tensorflow/commit/f0d0485b0de521ff273c8d91acbb2fbabe57baa7) that enhances dtype coverage of our GPU ops and improves the consistency of Tensorflow in general. This issue has exposed some problems we need to fix with our device placement that we’re planning to work on and will have an RFC for it. I’ll therefore recommend that you continue to use the drop_remainder=True workaround for now.
@jaingaurav is looking into a potential fix I believe.
Thanks @lgeiger for the update, that helps a lot. I can confirm that we have verified the regression (looks like it happened sometime back in April) We will update here when we have info into the root cause and fixes.
Thanks for the report @lgeiger . We will look more into this. The code sample uses MirroredStrategy - so I wanted to clarify when you said this regression is noticed on a single GPU as well - is that with or without MirroredStrategy? If latter, we would start investigating that case first (i.e. 1 GPU, no distribution, no mixed precision).
I was able to reproduce the issue. Here is the gist…