tensorflow: why does tensorflow2 use multiple Gpu but only one is used, the other always can not use

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary):
TensorFlow version (use command below):
Python version:
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:

TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

Describe the expected behavior

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

When I run my code to train, I found only one gpu is used while the others can not fully used. Exactly, for example, when the first gpu almost out of memory, the code can not use the memory of other gpus. I don’t know if it is a bug?

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (2 by maintainers)

Most upvoted comments

Okay. So I get your problem. One of your GPUs is being used too much and you want your data to be shifted to other GPUs to avoid the OOM error.

Q. But when tensorflow need to allocate more memory, why it can not allocate the extra memory to other GPU cards I sepecified before?

A. Tensorflow does not allocate memory to other GPUs properly or synchronously if it is not in tf.distribute.

Q. Is that can be solved by tf.distribute? A. Okay, tf.distribute provides several ways of dividing your data among various GPUs. It also supports synchronous distributed trining on GPUs. tf.distribute will be able to handle synchronous distributed trining on GPUs which is required in your case. This will not be activated explicitly without specifying.

Getting OOM error is because you are unable to fit more data / model into your GPU currently.

Q. Will tf.distribute solve OOM ? A. Maybe, it depends how huge your model is also how many operations can be parellized.

From the docs this is what MirroredStrategy will do

MirroredStrategy

tf.distribute.MirroredStrategy supports synchronous distributed training on multiple GPUs on one machine.

It creates one replica per GPU device. Each variable in the model is mirrored across all the replicas.

Together, these variables form a single conceptual variable called MirroredVariable. 

These variables are kept in sync with each other by applying identical updates.

please see the distributed training docs. It might solve your OOM problem but again it depends on your model size and training.

But I am sure this training strategy will be more efficient also significantly faster.

oke-aditya on Apr 28, 2020

Okay. So I get your problem. One of your GPUs is being used too much and you want your data to be shifted to other GPUs to avoid the OOM error.

Q. But when tensorflow need to allocate more memory, why it can not allocate the extra memory to other GPU cards I sepecified before?

A. Tensorflow does not allocate memory to other GPUs properly or synchronously if it is not in tf.distribute.

Q. Is that can be solved by tf.distribute? A. Okay, tf.distribute provides several ways of dividing your data among various GPUs. It also supports synchronous distributed trining on GPUs. tf.distribute will be able to handle synchronous distributed trining on GPUs which is required in your case. This will not be activated explicitly without specifying.

Getting OOM error is because you are unable to fit more data / model into your GPU currently.

Q. Will tf.distribute solve OOM ? A. Maybe, it depends how huge your model is also how many operations can be parellized.

From the docs this is what MirroredStrategy will do
MirroredStrategy

tf.distribute.MirroredStrategy supports synchronous distributed training on multiple GPUs on one machine.

It creates one replica per GPU device. Each variable in the model is mirrored across all the replicas.

Together, these variables form a single conceptual variable called MirroredVariable. 

These variables are kept in sync with each other by applying identical updates.
please see the distributed training docs. It might solve your OOM problem but again it depends on your model size and training.

But I am sure this training strategy will be more efficient also significantly faster.

So nice you are! Thanks for your detailed answers! I will check and try it by myself!

xuxiangsun on Apr 29, 2020

By default, tensorflow uses the GPU:0 as the default GPU. I would suggest first check the available GPUs using tf.config.list_physical_devices to see the number of GPUs available. If you have more than one GPU available then you need to do distributed GPU training. This can be easily done by defining a strategy.scope() These scopes are given in this documentation link. Please use a suitable strategy as per needs. In the simple case of having one machine with multiple GPUs, I guess the tf.distribute.MirroredStrategy will be useful. Please check the above docs for more info.

Thanks Sir. I have tried this method. But it didn’t work. By the way, Do you know if all the version of tensorflow(including tf1.x) also use one Gpu by default even if we set multiple GPUs to use? Cause I find that when I use tensorflow1.15, This also happens.

Tensorflow does not do distributed computing by default. Even in v1 or in v2. By setting multiple GPU cards it won’t mean that all will be used. I see that in your example you are writing a custom training loop instead of model.fit().

As per your system settings there is no issue with that part. Have a look at this example to use distributed training on custom loops.

oke-aditya on Apr 28, 2020