tensorflow: Upgrading from TF 2.1 to 2.2 gives 12% slowdown and 23% memory increase

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution: Ubuntu 18.04
TensorFlow version (use command below): 2.1.0, 2.2.0, tf-nightly
Python version: 3.6.9
CUDA/cuDNN version: 10.1
GPU model and memory: 8xTesla V100, 32GB each

Describe the current behavior

I’m running language modeling experiments with ALBERT, and GPU memory is at a premium due to the large batch sizes necessary. Upgrading from TF 2.1.0 to 2.2.0, I experienced OOM errors, so I ran a few benchmarks:

	TF nightly (May 8)	TF 2.2	TF 2.1
Iterations/Sec	1.63	1.64	1.86
GPU Memory	26 GB	25 GB	21 GB

That’s a combination of 12% speed slowdown, and 23% memory increase. I cannot upgrade until performance is matched. Are there any new experimental options, or changes I should be aware of that caused this massive performance hit? I’m using tf.function, XLA, and AMP.

It seems that fewer and fewer ops are converted to mixed-precision as we progress from TF 2.1->2.2->nightly. Is that related, and how can I restore the original behavior?

Poring over the release notes, the only thing that sticks out is: tf.constant always creates CPU tensors irrespective of the current device context.

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 1
Comments: 24 (14 by maintainers)

Most upvoted comments

TF 2.3 still has the same speed and memory slowdowns as 2.2. Any update @sanjoy @reedwm @mihaimaruseac ?

jarednielsen on Aug 20, 2020

So, I will describe my experiments briefly, but the gist is that with tensorflow==2.3.0rc2 (as well as with the nighly of July 25th, 2020) inference works as it did/does with TF2.1 and the problem is NOT due to the scipy version.

Detailed Experiments

Now on to the detailed experiments, which all start with a fresh miniconda install and the creation of virtual environments for the setups. In all cases, I first do conda install jupyterlab pandas numpy scipy scikit-learn matplotlib seaborn

First setup (TF-2.2):

Here I install TF as it comes with conda:

conda install tensorflow-gpu When I train and test my model, I experience both the slow start of training (only when using GPU) as well as not being able to carry out the prediction step on the test data (both with GPU as well as CPU only).

Second setup (TF-2.1):

Here I install TF in a downgraded conda version:

conda install tensorflow-gpu==2.1.0 When I train and test my model, I experience a moderately slow start of training (only when using GPU), but everything else works just fine.

Third setup (TF-2.3rc2 / TF-nightly):

Here I first install TF as usual (to obtain cudnn, cupti, and cudatoolkit), only to remove it and replace it with the pip version. Also, I downgrade scipy from 1.5.0 to 1.4.1 in order not to get a warning from the pip install:

conda install tensorflow-gpu conda uninstall --force tensorflow tensorflow-base tensorflow-gpu scipy pip install scipy==1.4.1 tensorflow-gpu==2.3.0rc2

When I train and test my model, I experience a very slow start of training (only when using GPU), but everything else works just fine. The same is true if I replace tensorflow-gpu==2.3.0rc2 with tf-nightly-gpu.

Additional Information on the slow start of training

So, the problem with the slow start of training does NOT happen, if I use the CPU only, but it does happen when using the GPU (and it seems to have been reported before, where the error message has been reported here). It should be noted that with TF2.1, the error message just reads

Not found: ./bin/ptxas not found Relying on driver to perform ptx compilation. This message will be only logged once. and then TF more or less just gets on with its business (first epoch takes ~7s, so there is a moderate holdup, but this is normal and also happens in further reduced form when using the CPU only) , while for all later versions there is a long holdup before training starts (first epoch takes between 35s - 71s).

Warning: You might want to refrain from trying next part …

Now, it seems as if one can get around this problem by reading all of this Stackoverflow discussion and, immediately after installing TF via conda, doing a combination of two answers: conda install -c conda-forge cudatoolkit-dev This will update the standard cudatoolkit and provide both nvccas well as ptxas.

However, there are at least two caveats (and one Ooops - see below)

Please note that apparently this package can be installed in one virtual environment only. Since in order to be able to install it in the environment with TF-2.3rc2, I first had to remove the environment with the nightly where I had originally installed it first (maybe uninstalling the toolkit would have been enough, but I did not want to keep the nightly anyway).
Also, the virtual environment will have its own cudatoolkit in its pkgs directory. However, a lot of the shared library files in the lib64 directory have length zero, so, unsurprisingly, the GPU cannot be used. This can be remedied by copying over everything from the global cudatoolkit lib64 directory, but this still leaves quite a few zero length files, which might play some part somewhere. For running the training, however, everything seems to be ok.

And now to the really bad part - the Ooops: It seems that the installation of the cudatoolkit-dev messes with your system (somehow, although it was not done with root priviliges) and not just with miniconda! After installing it, nvidia-detector crashes. While I am not sure how serious this actually is (the system is still running and also works as expected after reboot, at least for now, afaict). So, if you do this, you do it at your own risk!

So, sorry for the long post, but maybe this actually answers @jarednielsen 's original question and also offers a way to sidestep the issue, albeit at a certain cost …

dspyrhsu on Jul 25, 2020