tensorflow: Memory leak in zeros_like/Tile

So I’m trying to figure out why my resnets are running out of memory, and it seems that there’s a memory leak in Tile and zeros_like operations.

Those ops have memory allocated during each session run but there’s no __LOG_MEMORY__ deallocation messages corresponding to them. The sum of missing deallocations matches the amount of memory leaked as reported by allocator as max_bytes_in_use (accessed through tf.contrib.memory_stats.MaxBytesInUse op)

Here’s a simplified repro, at each sess.run, the memory grows by 1.15 GB until it crashes with OOM https://github.com/yaroslavvb/stuff/blob/master/resnet_leak_report2.py

When I run it, I see

Run 0, GBs in use 2.30
Run 1, GBs in use 3.60
Run 2, GBs in use 4.75
Run 3, GBs in use 5.90
2017-09-21 14:56:31.994302: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 137.33MiB.  Current allocation summary follows....

Offending ops:

gradients/leaky_relu_grad/zeros_like 576MB
gradients/Sum_grad/Tile  576MB

Version: Ubuntu 16:04 official TensorFlow Linux GPU Python 3.5 nightly wheel from today

version: 1.4.0-dev20170921
__git_version__: v1.3.0-rc1-2408-ge9d5ee1
Commit https://github.com/tensorflow/tensorflow/commit/e9d5ee1

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 32 (31 by maintainers)

Most upvoted comments

tf.zeros has two implementations, one which uses constants and one which uses fill. I do not understand why we prefer the one which uses constant as it leads to huge graphs (and we should be able to constant-fold fill anyway). For eager we already never use the constant one as it leads to huge CPU-to-GPU copies of the constants every time you run zeros which is face-palmingly bad behavior.

Now that we have real performance benchmarks does anyone oppose me investigating whether we can make tf.zeros always use fill?

The summarize this bug – there’s memory leak because tf.ones/tf.zeros use tf.constant. tf.constant doesn’t deallocate it’s memory by design. Recommended fix is to move tf.ones/tf.zeros to always tf.fill instead.