tensorflow: Memory leak
System information - Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes, see below - OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04 (also demonstrated in Windows 7) - TensorFlow installed from (source or binary): Binary - TensorFlow version (use command below): v2.0.0-rc2-26-g64c3d38 2.0.0 (problem disappears on v1.14.0-rc1-22-gaf24dc91b5 1.14.0) - Python version: 3.6.8 - CUDA/cuDNN version: Cuda v10.0, CuDNN v7.6.2.24 - GPU model and memory: Nvidia GeForce 840M, but problem persists in non-GPU version
Describe the current behavior When creating a trivially simple model and then entering a loop that calls predict() with dummy input, memory consumption increases indefinitely over time. On my system, a model with a single hidden layer of only 32 nodes will consume all available system RAM (>10gb) after only 10 minutes. The problem happens on v2.0 (GPU or CPU) of tensorflow, but disappears when running identical code on v1.14.
Describe the expected behavior Expect memory consumption to quickly stabilize but it never does.
Code to reproduce the issue
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense
import numpy as np
# Build model
In = Input(shape=(10,))
x = Dense(32)(In)
Out = Dense(2)(x)
# Compile
model = Model(inputs=In, outputs=Out)
model.compile(optimizer='adam', loss='mse')
# Create dummy input data
fake_data = np.random.uniform(low=0, high=1.0, size=(1, 10, ))
while True:
# Repeatedly predict:
model.predict(fake_data) # No memory leak if this line is replaced with "pass"
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 11
- Comments: 54 (2 by maintainers)
Commits related to this issue
- training: Fix memory leak in evaluation loop See tensorflow/tensorflow#33009 for more details. — committed to JakobGM/project-thesis by JakobGM 4 years ago
I have managed to get around this error by using
model.predict_on_batch()instead ofmodel.predict(). This returns an object of type<class 'tensorflow.python.framework.ops.EagerTensor'>- not a numpy array as claimed in the docs - but it can be cast by callingnp.array(model.predict_on_batch(input_data))to get the output I want.Side note: I also noticed a similar memory leak problem with calling
model.fit()in a loop, albeit with a slower memory accumulation, but this can be fixed in a similar way usingmodel.train_on_batch().I have also experienced this critical issue.
The issue persists with tensorflow 2.2.0. Would be better if someone from tensorflow team made a statement about it. It severely hinders the ability to train for larger number of epochs
This is an issue with how certain object lifetimes are being managed in the tf function cache. Behind the scenes
Model.predictis creating some functions which aren’t being spun down properly, which is why there is a small but consistent leak each iteration. We are currently working on a fix.Still getting this error with tf-nightly (2.2.0-dev20200415)
Same issue with TF 2.0 stable Solved with
tf.compat.v1.disable_eager_execution()@MProx Yes, the original code as shown below causes memory leak.
Below is the memory consumption over time (it increases initially which is expected, but after certain point when it’s supposed to be constant, it rather increases although by a very small amount, I ran it for only 30 minutes)
Avoiding
model.predictand usingmodel.predict_on_batchsolves the Memory Error for me. Here’s an example to create batched predictions on your test set.(Note that if only pre-allocating the results array already results in a MemoryError then simply the array does not fit in your available memory regardless of the memory leak issue.)
Hi,
I am trying to call
model.predict()on CPU multiple times and I observe RAM memory leak.clear_session()with model reload andgc.collect()doesn’t solve the issue. I ran the code on tensorflow 2.1 and 2.3 as well but issue still persists. Is there a workaround for this issue? I am using TF 1.14 and Python 3.6. Have been struggling to solve this problem since so long.Hi,
I am trying to call
model.predict()on CPU (model trained on GPU) multiple times and I observe RAM memory leak.clear_sessionwith model reload doesn’t solve the issue.model.predict_on_batch()fails to solve the issue as well. Is there a workaround for this issue? I am using TF 1.13 and Python 3.6. Have been struggling to solve this problem since so long. Kinda need help.In that case I’m going to re-open this issue so that maybe the TF team can help you.
This issue is NOT resolved. I use predict_on_batch and still get an OOM error after some time of processing data.
Witnessing the same issue while running inference with TF
Having similar issue in latest TF 2.4.1. Growing from about 60GB (I’m using a large shuffle buffer) to 128GB over the course of a few hours. Not sure if it’t the same issue as originally mentioned here, since it seems to be more subtle and any number of causes could be the memory leak. Would file a new issue but it’s challenging to create a minimal reproducible example
Hi. I have done what you ask and tested
tf-nightlywith my trivial code example from the original issue above, and it does indeed seem to fix the problem. I also tried it in a different application that I’ve been using the predict_on_batch workaround with, and there seem to be no memory leaks there either. So yes, the leak seems to have been dealt with. Thank you!@MProx, about a month ago we added a fix for memory leak and there’s a possibility it has fixed this. Can you try
!pip install tf-nightlyand see if it resolves your issue?model.predict_on_batchsolved this problem for me too@ialdencoots It sounds like you might be having a different OOM issue than the author of this issue and I were having. Incidentally, in my case, I am also loading my model from an .h5 file, and
predict_on_batchfixed it. I would suggest trying to create a simpler version of your code that still has the problem to narrow down the cause and post it as a separate issue.@MProx, I would recommend leaving this issue open. Although we have a workaround, there is a defect in the TensorFlow version of Keras’s
predictfunction that should be fixed.Ok, so I messed around a bit w/ my model and I’ve narrowed down the problem a little bit more. When I load my trained model from an
.h5file and runpredict_on_batchrepeatedly, I get the OOM error eventually. However, if I create a new model with the loaded model’s inputs and outputs, I canpredict_on_batchfor my whole dataset without problem. If I then runcompileon the model with an optimizer and loss, I get the OOM error again. So it seems to be a problem only for compiled models.