tensorflow: TF2.9 perf is slower
Click to expand!
Issue Type
Performance
Source
source
Tensorflow Version
tf2.9
Custom Code
No
OS Platform and Distribution
No response
Mobile device
No response
Python version
No response
Bazel version
No response
GCC/Compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
A100
Current Behaviour?
We migrate from tf2.4 to tf2.9, and observed that the training speed of some models has ~20% decrease.
On tf2.4, it takes ~30mins after starting the job, before processing the 1st batch. Training speed increase and then become stable.
On tf2.9, it takes ~5mins after starting the job, before processing the 1st batch. Training speed does not increase.
Q1: Can we use tf2.4 to train the model and use tf2.9 for inferencing? Any potential issues?
Q2: How can we find the root cause of tf2.9 training slowness?
Standalone code to reproduce the issue
Can't share the source code.
Relevant log output
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (3 by maintainers)
@SuryanarayanaY os: ubuntu 20.04 bazel: 5.2.0 cuda: 11.2
epoch time: 2.4: 70hours 2.9: 90hours
Can we schedule a online debug session if that helps?
Fast rcnn is a similar model, but our model is more complicated. tf2.4: 3.5 batches / sec. tf2.9: 2.5 batches per sec.