tensorflow: distribute.MirroredStrategy fails with Resource exhausted: OOM when allocating tensor with shape

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): source r1.10
  • TensorFlow version (use command below): 1.10
  • Python version:
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
  • GPU model and memory:
  • Exact command to reproduce:

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c “import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)”

Describe the problem

We have a training code based on tf.Estimator that works well on single GPU with tf.contrib.distribute.OneDeviceStrategy("device:GPU:0"). But when we add another GPU and change the distribution type to tf.contrib.distribute.MirroredStrategy(num_gpus=num_gpus) the training code doesn’t run anymore and raise an ugly memory allocation error. Even if we reduce drastically the batch size (from 128 to 64). Below you’ll find a sample of the output error.

Source code / logs

2018-09-05 12:06:36.826713: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x7fa445486000 of size 2013265920
2018-09-05 12:06:36.826719: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x7fa4bd486000 of size 2013265920
2018-09-05 12:06:36.826725: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Free  at 0x7fa535486000 of size 2013265920
2018-09-05 12:06:36.826730: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Chunk at 0x7fa5ad486000 of size 2013501696
2018-09-05 12:06:36.826735: I tensorflow/core/common_runtime/bfc_allocator.cc:665] Free  at 0x7fa6254bf900 of size 1855833856
2018-09-05 12:06:36.826741: I tensorflow/core/common_runtime/bfc_allocator.cc:671]      Summary of in-use Chunks by size:
2018-09-05 12:06:36.826748: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 17 Chunks of size 256 totalling 4.2KiB
2018-09-05 12:06:36.826754: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 1280 totalling 2.5KiB
2018-09-05 12:06:36.826760: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 3 Chunks of size 2048 totalling 6.0KiB
2018-09-05 12:06:36.826766: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 2304 totalling 2.2KiB
2018-09-05 12:06:36.826772: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 9728 totalling 19.0KiB
2018-09-05 12:06:36.826778: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 73728 totalling 144.0KiB
2018-09-05 12:06:36.826784: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 147456 totalling 288.0KiB
2018-09-05 12:06:36.826791: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 13516800 totalling 12.89MiB
2018-09-05 12:06:36.826797: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 2013265920 totalling 3.75GiB
2018-09-05 12:06:36.826803: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 2013501696 totalling 1.88GiB
2018-09-05 12:06:36.826809: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Sum Total of in-use chunks: 5.64GiB
2018-09-05 12:06:36.826818: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats:
Limit:                 11922948096
InUse:                  6054027520
MaxInUse:               9980828416
NumAllocs:                     153
MaxAllocSize:           3507027968

2018-09-05 12:06:36.826832: W tensorflow/core/common_runtime/bfc_allocator.cc:279] *_______________***********************************________________******************_______________
2018-09-05 12:06:36.826879: W tensorflow/core/framework/op_kernel.cc:1275] OP_REQUIRES failed at nccl_ops.cc:96 : Resource exhausted: OOM when allocating tensor with shape[503375361] and type float on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc

...

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[503375361] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[Node: NcclAllReduce = NcclAllReduce[T=DT_FLOAT, _class=["loc:@Reshape_28"], num_devices=2, reduction="sum", shared_name="c0", _device="/job:localhost/replica:0/task:0/device:GPU:0"](concat)]]

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 2
  • Comments: 15 (10 by maintainers)

Most upvoted comments

The concat op has to create a memory block to store the concatenated result. In our our nccl packing algorithm, we concat all gradients into one large tensor. We can switch to a different tensor aggregation method (specifying non-zero agg_small_grads_max_bytes and agg_small_grads_max_group and set num_packs to 0) or even disable tensor aggregation. CollectiveAllReduceStrategy is another option which I think avoids creating large concatenated tensors as well.

From what I can understand, packing and splitting shouldn’t affect overall memory usage. This logic concats many (small) tensors into num_packs (larger) tensors. But overall memory usage should remain the same.

I’m not aware of a size limitation when using nccl.

Looking at the logs above, the op_kernel.cc failure reports OOM at GPU:1, but the ResourceExhaustedError reports OOM at GPU:0. @yuefengz could there be an issue with placing the concat and split ops on correct devices? I noticed that we use ops.colocate_with which was recently deprecated.

Collectives has this logic built into the C++ backend via the ScopedAllocator. Conceptually it does a similar thing. We haven’t seen any OOMs due to ScopedAllocator.

@jrabary thanks for trying it out. num_packs=0 means that we will reduce each gradient separately, instead of trying to combine all of them into a small number of tensors first. Performance impact will depend on the use case. In the cases where the gradient tensors are large though (like it seems in your use case), it is not feasible to combine them given the limited memory.

@yuefengz @dubey it seems like we should not try to combine gradients if there isn’t enough memory. Can we do this in MirroredStrategy? Does CollectiveAllReduceStrategy check this?

Hi @guptapriya, I did and setting num_packs=0 seems to work. So what does it really mean if we set num_packs to zero ? And what are the side effects w.r.t optimisation results ?

@jrabary thank you for sharing the code. Could you try setting num_packs=0 when you define the cross_tower_ops using AllReduceCrossTowerOps? My hypothesis is that it is packing all the gradients into 2 tensors (with num_packs=2) and this is too big for nccl to handle.