tensorflow: TF2.9 perf is slower

Click to expand!

Issue Type

Performance

Source

source

Tensorflow Version

tf2.9

Custom Code

No

OS Platform and Distribution

No response

Mobile device

No response

Python version

No response

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

A100

Current Behaviour?

We migrate from tf2.4 to tf2.9, and observed that the training speed of some models has ~20% decrease. 

On tf2.4, it takes ~30mins after starting the job, before processing the 1st batch. Training speed increase and then become stable.
On tf2.9, it takes ~5mins after starting the job, before processing the 1st batch. Training speed does not increase.

Q1: Can we use tf2.4 to train the model and use tf2.9 for inferencing? Any potential issues?
Q2: How can we find the root cause of tf2.9 training slowness?

Standalone code to reproduce the issue

Can't share the source code.

Relevant log output

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (3 by maintainers)

Most upvoted comments

@SuryanarayanaY os: ubuntu 20.04 bazel: 5.2.0 cuda: 11.2

epoch time: 2.4: 70hours 2.9: 90hours

Can we schedule a online debug session if that helps?

Fast rcnn is a similar model, but our model is more complicated. tf2.4: 3.5 batches / sec. tf2.9: 2.5 batches per sec.