tensorflow: why does tensorflow2 use multiple Gpu but only one is used, the other always can not use
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
- TensorFlow installed from (source or binary):
- TensorFlow version (use command below):
- Python version:
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version:
- GPU model and memory:
You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with:
- TF 1.0:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
- TF 2.0:
python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
Describe the expected behavior
Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.
Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
When I run my code to train, I found only one gpu is used while the others can not fully used. Exactly, for example, when the first gpu almost out of memory, the code can not use the memory of other gpus. I don’t know if it is a bug?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (2 by maintainers)
Okay. So I get your problem. One of your GPUs is being used too much and you want your data to be shifted to other GPUs to avoid the OOM error.
Q. But when tensorflow need to allocate more memory, why it can not allocate the extra memory to other GPU cards I sepecified before?
A. Tensorflow does not allocate memory to other GPUs properly or synchronously if it is not in tf.distribute.
Q. Is that can be solved by tf.distribute? A. Okay, tf.distribute provides several ways of dividing your data among various GPUs. It also supports synchronous distributed trining on GPUs.
tf.distribute
will be able to handle synchronous distributed trining on GPUs which is required in your case. This will not be activated explicitly without specifying.Getting OOM error is because you are unable to fit more data / model into your GPU currently.
Q. Will tf.distribute solve OOM ? A. Maybe, it depends how huge your model is also how many operations can be parellized.
From the docs this is what MirroredStrategy will do
please see the distributed training docs. It might solve your OOM problem but again it depends on your model size and training.
But I am sure this training strategy will be more efficient also significantly faster.
So nice you are! Thanks for your detailed answers! I will check and try it by myself!
Tensorflow does not do distributed computing by default. Even in v1 or in v2. By setting multiple GPU cards it won’t mean that all will be used. I see that in your example you are writing a custom training loop instead of model.fit().
As per your system settings there is no issue with that part. Have a look at this example to use distributed training on custom loops.