tensorflow: Memory leak in zeros_like/Tile

So I’m trying to figure out why my resnets are running out of memory, and it seems that there’s a memory leak in Tile and zeros_like operations.

Those ops have memory allocated during each session run but there’s no __LOG_MEMORY__ deallocation messages corresponding to them. The sum of missing deallocations matches the amount of memory leaked as reported by allocator as max_bytes_in_use (accessed through tf.contrib.memory_stats.MaxBytesInUse op)

Here’s a simplified repro, at each sess.run, the memory grows by 1.15 GB until it crashes with OOM https://github.com/yaroslavvb/stuff/blob/master/resnet_leak_report2.py

When I run it, I see

Run 0, GBs in use 2.30
Run 1, GBs in use 3.60
Run 2, GBs in use 4.75
Run 3, GBs in use 5.90
2017-09-21 14:56:31.994302: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 137.33MiB.  Current allocation summary follows....

Offending ops:

gradients/leaky_relu_grad/zeros_like 576MB
gradients/Sum_grad/Tile  576MB

Version: Ubuntu 16:04 official TensorFlow Linux GPU Python 3.5 nightly wheel from today

version: 1.4.0-dev20170921
__git_version__: v1.3.0-rc1-2408-ge9d5ee1
Commit https://github.com/tensorflow/tensorflow/commit/e9d5ee1

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 1
Comments: 32 (31 by maintainers)

Most upvoted comments

tf.zeros has two implementations, one which uses constants and one which uses fill. I do not understand why we prefer the one which uses constant as it leads to huge graphs (and we should be able to constant-fold fill anyway). For eager we already never use the constant one as it leads to huge CPU-to-GPU copies of the constants every time you run zeros which is face-palmingly bad behavior.

Now that we have real performance benchmarks does anyone oppose me investigating whether we can make tf.zeros always use fill?

alextp on Oct 3, 2017

The summarize this bug – there’s memory leak because tf.ones/tf.zeros use tf.constant. tf.constant doesn’t deallocate it’s memory by design. Recommended fix is to move tf.ones/tf.zeros to always tf.fill instead.

yaroslavvb on Dec 20, 2017