tensorflow: distribute.MirroredStrategy fails with Resource exhausted: OOM when allocating tensor with shape
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
- TensorFlow installed from (source or binary): source r1.10
- TensorFlow version (use command below): 1.10
- Python version:
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version:
- GPU model and memory:
- Exact command to reproduce:
You can collect some of this information using our environment capture script:
https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh
You can obtain the TensorFlow version with
python -c “import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)”
Describe the problem
We have a training code based on tf.Estimator that works well on single GPU with tf.contrib.distribute.OneDeviceStrategy("device:GPU:0"). But when we add another GPU and change the distribution type to tf.contrib.distribute.MirroredStrategy(num_gpus=num_gpus) the training code doesn’t run anymore and raise an ugly memory allocation error. Even if we reduce drastically the batch size (from 128 to 64). Below you’ll find a sample of the output error.
Source code / logs
2018-09-05 12:06:36.826713: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x7fa445486000 of size 2013265920
2018-09-05 12:06:36.826719: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x7fa4bd486000 of size 2013265920
2018-09-05 12:06:36.826725: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Free at 0x7fa535486000 of size 2013265920
2018-09-05 12:06:36.826730: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x7fa5ad486000 of size 2013501696
2018-09-05 12:06:36.826735: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Free at 0x7fa6254bf900 of size 1855833856
2018-09-05 12:06:36.826741: I tensorflow/core/common_runtime/bfc_allocator.cc:671] Summary of in-use Chunks by size:
2018-09-05 12:06:36.826748: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 17 Chunks of size 256 totalling 4.2KiB
2018-09-05 12:06:36.826754: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 1280 totalling 2.5KiB
2018-09-05 12:06:36.826760: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 3 Chunks of size 2048 totalling 6.0KiB
2018-09-05 12:06:36.826766: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 2304 totalling 2.2KiB
2018-09-05 12:06:36.826772: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 9728 totalling 19.0KiB
2018-09-05 12:06:36.826778: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 73728 totalling 144.0KiB
2018-09-05 12:06:36.826784: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 147456 totalling 288.0KiB
2018-09-05 12:06:36.826791: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 13516800 totalling 12.89MiB
2018-09-05 12:06:36.826797: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 2013265920 totalling 3.75GiB
2018-09-05 12:06:36.826803: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 2013501696 totalling 1.88GiB
2018-09-05 12:06:36.826809: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Sum Total of in-use chunks: 5.64GiB
2018-09-05 12:06:36.826818: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats:
Limit: 11922948096
InUse: 6054027520
MaxInUse: 9980828416
NumAllocs: 153
MaxAllocSize: 3507027968
2018-09-05 12:06:36.826832: W tensorflow/core/common_runtime/bfc_allocator.cc:279] *_______________***********************************________________******************_______________
2018-09-05 12:06:36.826879: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at nccl_ops.cc:96 : Resource exhausted: OOM when allocating tensor with shape[503375361] and type float on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
...
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[503375361] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: NcclAllReduce = NcclAllReduce[T=DT_FLOAT, _class=["loc:@Reshape_28"], num_devices=2, reduction="sum", shared_name="c0", _device="/job:localhost/replica:0/task:0/device:GPU:0"](concat)]]
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 2
- Comments: 15 (10 by maintainers)
The
concatop has to create a memory block to store the concatenated result. In our our nccl packing algorithm, we concat all gradients into one large tensor. We can switch to a different tensor aggregation method (specifying non-zeroagg_small_grads_max_bytesandagg_small_grads_max_groupand setnum_packsto 0) or even disable tensor aggregation.CollectiveAllReduceStrategyis another option which I think avoids creating large concatenated tensors as well.From what I can understand, packing and splitting shouldn’t affect overall memory usage. This logic concats many (small) tensors into
num_packs(larger) tensors. But overall memory usage should remain the same.I’m not aware of a size limitation when using
nccl.Looking at the logs above, the
op_kernel.ccfailure reports OOM atGPU:1, but theResourceExhaustedErrorreports OOM atGPU:0. @yuefengz could there be an issue with placing the concat and split ops on correct devices? I noticed that we useops.colocate_withwhich was recently deprecated.Collectives has this logic built into the C++ backend via the
ScopedAllocator. Conceptually it does a similar thing. We haven’t seen any OOMs due toScopedAllocator.@jrabary thanks for trying it out. num_packs=0 means that we will reduce each gradient separately, instead of trying to combine all of them into a small number of tensors first. Performance impact will depend on the use case. In the cases where the gradient tensors are large though (like it seems in your use case), it is not feasible to combine them given the limited memory.
@yuefengz @dubey it seems like we should not try to combine gradients if there isn’t enough memory. Can we do this in MirroredStrategy? Does CollectiveAllReduceStrategy check this?
Hi @guptapriya, I did and setting
num_packs=0seems to work. So what does it really mean if we set num_packs to zero ? And what are the side effects w.r.t optimisation results ?@jrabary thank you for sharing the code. Could you try setting
num_packs=0when you define thecross_tower_opsusingAllReduceCrossTowerOps? My hypothesis is that it is packing all the gradients into 2 tensors (with num_packs=2) and this is too big for nccl to handle.