tensorflow: Memory leak

System information - Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes, see below - OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04 (also demonstrated in Windows 7) - TensorFlow installed from (source or binary): Binary - TensorFlow version (use command below): v2.0.0-rc2-26-g64c3d38 2.0.0 (problem disappears on v1.14.0-rc1-22-gaf24dc91b5 1.14.0) - Python version: 3.6.8 - CUDA/cuDNN version: Cuda v10.0, CuDNN v7.6.2.24 - GPU model and memory: Nvidia GeForce 840M, but problem persists in non-GPU version

Describe the current behavior When creating a trivially simple model and then entering a loop that calls predict() with dummy input, memory consumption increases indefinitely over time. On my system, a model with a single hidden layer of only 32 nodes will consume all available system RAM (>10gb) after only 10 minutes. The problem happens on v2.0 (GPU or CPU) of tensorflow, but disappears when running identical code on v1.14.

Describe the expected behavior Expect memory consumption to quickly stabilize but it never does.

Code to reproduce the issue

from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense
import numpy as np

# Build model
In = Input(shape=(10,))
x = Dense(32)(In)
Out = Dense(2)(x)

# Compile
model = Model(inputs=In, outputs=Out)
model.compile(optimizer='adam', loss='mse')        

# Create dummy input data
fake_data = np.random.uniform(low=0, high=1.0, size=(1, 10, ))

while True:
    # Repeatedly predict:
    model.predict(fake_data) # No memory leak if this line is replaced with "pass"

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 11
  • Comments: 54 (2 by maintainers)

Commits related to this issue

Most upvoted comments

I have managed to get around this error by using model.predict_on_batch() instead of model.predict(). This returns an object of type <class 'tensorflow.python.framework.ops.EagerTensor'> - not a numpy array as claimed in the docs - but it can be cast by calling np.array(model.predict_on_batch(input_data)) to get the output I want.

Side note: I also noticed a similar memory leak problem with calling model.fit() in a loop, albeit with a slower memory accumulation, but this can be fixed in a similar way using model.train_on_batch().

I have also experienced this critical issue.

The issue persists with tensorflow 2.2.0. Would be better if someone from tensorflow team made a statement about it. It severely hinders the ability to train for larger number of epochs

This is an issue with how certain object lifetimes are being managed in the tf function cache. Behind the scenes Model.predict is creating some functions which aren’t being spun down properly, which is why there is a small but consistent leak each iteration. We are currently working on a fix.

Still getting this error with tf-nightly (2.2.0-dev20200415)

Same issue with TF 2.0 stable Solved with tf.compat.v1.disable_eager_execution()

@MProx Yes, the original code as shown below causes memory leak.

from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense
import numpy as np

# Build model
In = Input(shape=(10,))
x = Dense(32)(In)
Out = Dense(2)(x)

# Compile
model = Model(inputs=In, outputs=Out)
model.compile(optimizer='adam', loss='mse')        

# Create dummy input data
fake_data = np.random.uniform(low=0, high=1.0, size=(1, 10, ))

while True:
    # Repeatedly predict:
    model.predict(fake_data) # No memory leak if this line is replaced with "pass"

Below is the memory consumption over time (it increases initially which is expected, but after certain point when it’s supposed to be constant, it rather increases although by a very small amount, I ran it for only 30 minutes)

CMDLINE python test.py
CMDLINE python test.py
MEM 0.609375 1579247671.2321
MEM 22.660156 1579247671.3397
MEM 68.218750 1579247671.4467
MEM 104.500000 1579247671.5552
MEM 114.675781 1579247671.6605
MEM 124.125000 1579247671.7657
MEM 131.191406 1579247671.8710
MEM 146.972656 1579247671.9783
MEM 156.164062 1579247672.0838
MEM 168.886719 1579247672.1893
MEM 182.667969 1579247672.2945
MEM 192.363281 1579247672.4000
MEM 202.539062 1579247672.5052
MEM 211.316406 1579247672.6102
MEM 211.968750 1579247672.7151
MEM 215.250000 1579247672.8199
MEM 215.464844 1579247672.9256
MEM 215.738281 1579247673.0316
MEM 215.890625 1579247673.1366
MEM 216.050781 1579247673.2414
MEM 216.468750 1579247673.3461
MEM 221.535156 1579247673.4508
MEM 226.820312 1579247673.5557
MEM 227.011719 1579247673.7651
MEM 227.261719 1579248604.1910
MEM 227.511719 1579249397.2364
MEM 228.515625 1579250955.6305
MEM 228.304688 1579251348.9152

Avoiding model.predict and using model.predict_on_batch solves the Memory Error for me. Here’s an example to create batched predictions on your test set.

# custom batched prediction loop to avoid memory leak issues for now in the model.predict call
y_pred_probs = np.empty([len(X_test), VOCAB_SIZE], dtype=np.float32)  # pre-allocate required memory for array for efficiency

BATCH_INDICES = np.arange(start=0, stop=len(X_test), step=BATCH_SIZE)  # row indices of batches
BATCH_INDICES = np.append(BATCH_INDICES, len(X_test))  # add final batch_end row

for index in np.arange(len(BATCH_INDICES) - 1):
    batch_start = BATCH_INDICES[index]  # first row of the batch
    batch_end = BATCH_INDICES[index + 1]  # last row of the batch
    y_pred_probs[batch_start:batch_end] = model.predict_on_batch(X_test[batch_start:batch_end])

(Note that if only pre-allocating the results array already results in a MemoryError then simply the array does not fit in your available memory regardless of the memory leak issue.)

Hi,

I am trying to call model.predict() on CPU multiple times and I observe RAM memory leak. clear_session() with model reload and gc.collect() doesn’t solve the issue. I ran the code on tensorflow 2.1 and 2.3 as well but issue still persists. Is there a workaround for this issue? I am using TF 1.14 and Python 3.6. Have been struggling to solve this problem since so long.

Hi,

I am trying to call model.predict() on CPU (model trained on GPU) multiple times and I observe RAM memory leak. clear_session with model reload doesn’t solve the issue. model.predict_on_batch() fails to solve the issue as well. Is there a workaround for this issue? I am using TF 1.13 and Python 3.6. Have been struggling to solve this problem since so long. Kinda need help.

In that case I’m going to re-open this issue so that maybe the TF team can help you.

This issue is NOT resolved. I use predict_on_batch and still get an OOM error after some time of processing data.

Witnessing the same issue while running inference with TF

Having similar issue in latest TF 2.4.1. Growing from about 60GB (I’m using a large shuffle buffer) to 128GB over the course of a few hours. Not sure if it’t the same issue as originally mentioned here, since it seems to be more subtle and any number of causes could be the memory leak. Would file a new issue but it’s challenging to create a minimal reproducible example

Hi. I have done what you ask and tested tf-nightly with my trivial code example from the original issue above, and it does indeed seem to fix the problem. I also tried it in a different application that I’ve been using the predict_on_batch workaround with, and there seem to be no memory leaks there either. So yes, the leak seems to have been dealt with. Thank you!

@MProx, about a month ago we added a fix for memory leak and there’s a possibility it has fixed this. Can you try !pip install tf-nightly and see if it resolves your issue?

model.predict_on_batch solved this problem for me too

@ialdencoots It sounds like you might be having a different OOM issue than the author of this issue and I were having. Incidentally, in my case, I am also loading my model from an .h5 file, and predict_on_batch fixed it. I would suggest trying to create a simpler version of your code that still has the problem to narrow down the cause and post it as a separate issue.

@MProx, I would recommend leaving this issue open. Although we have a workaround, there is a defect in the TensorFlow version of Keras’s predict function that should be fixed.

Ok, so I messed around a bit w/ my model and I’ve narrowed down the problem a little bit more. When I load my trained model from an .h5 file and run predict_on_batch repeatedly, I get the OOM error eventually. However, if I create a new model with the loaded model’s inputs and outputs, I can predict_on_batch for my whole dataset without problem. If I then run compile on the model with an optimizer and loss, I get the OOM error again. So it seems to be a problem only for compiled models.