keras: Repeatedly calling model.predict(...) results in memory leak
System information
- Have I written custom code (as opposed to using example directory): Yes, minimal example attached
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS 10.14.5
- TensorFlow backend (yes / no): yes
- TensorFlow version: 1.12.0
- Keras version: 2.2.4
- Python version: 3.6.8
- CUDA/cuDNN version: N/A
- GPU model and memory: N/A
Describe the current behavior
Loading a model once and then repeatedly calling model.predict(...)
results in continually increasing memory usage
Describe the expected behavior
Calling model.predict(...)
should not result in any permanent increase in memory usage
Code to reproduce the issue
import keras
import numpy as np
model = keras.applications.mobilenet_v2.MobileNetV2()
X = np.random.rand(1, 224, 224, 3)
while True:
# Leaks:
y = model.predict(X)[0]
# Does not leak:
# y = [0]
Other info / logs
I am running on a Mac Mini and using CPU only. When I run the code above, memory usage climbs steadily and will eventually consume 20+ GB of memory!
The issue does not appear to exist on my Ubuntu 18.04 laptop using a GTX 1050
I’m open to the idea that I may be doing something wrong, but with such a small example it seems hard to believe.
Let me know what other info I can give to help.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 37
- Comments: 92 (6 by maintainers)
Commits related to this issue
- submission_df() | BUGFIX: Repeatedly calling model.predict(...) results in memory leak - https://github.com/keras-team/keras/issues/13118 — committed to JamesMcGuigan/kaggle-bengali-ai by JamesMcGuigan 4 years ago
- Reduce memory leaked by using model() instead of model.predict() In a long-running process invoking `nsfw_detector.predict.classify_nd()` often, we observed memory leakage consistent with what [keras... — committed to colindean/nsfw_model_macos by colindean a year ago
One way to avoid this. Try
instead of using
model.predict(feature_list)
.And do not be happy so quickly, there is also a slow leakage in
model.fit()
function. Geez, did keras team really do any test before releasing it?Same issue. My memory usage balloons while calling model.predict() in a loop. Even with tf.keras.backend.clear_session() and gc.collect() at the end of the loop.
I really would like a fix to this.
@JivanRoquet Hello! You are right, the neural network in tensorflow feeds on (batched) tensors rather than ndarrays. I use PyTorch now, it worked perfect for my reinforcement learning project, no memory leakage at all. I suggest anyone who can select their platform go for PyTorch. It is stable, fast and easy to use.
I have the same issue. model.predict() causes a big memory leak. I am not sure how posting here helps - There is absolutely no response from the TF team, going by the numerous threads on this topic.
Isn’t model.predict() the core of this whole model building business. Not sure why this issue is persisting from TF2.0 and still goes on today without any real response or post from the TF team.
Sucks that so many people are wasting their time on this
model.predict_on_batch fix for me.
I tested
predict_on_batch
and it made a big difference compared topredict
method, but still it has memory leak.In a related TensorFlow issue (https://github.com/tensorflow/tensorflow/issues/33009), someone noted that
predict_on_batch
doesn’t display the same issue. That workaround was effective for me, but not for another person in that thread.This workaround does not work for me. I have to train a rolling window in an online fashion so
clear_session
is not available to me and I cannot separate the memory for different iterations. I have been increasing memory on my compute node just to get some results but this leak is just too big for large datasets. I’m using Tensorflow v2.5 on a CPU-only machine. @fchollet this is a bug that is not solved and has no catch-all workaround, please consider reopening.I’m facing the same issue on Windows 10 and even though “K. clear_session()” does solve, or at least alleviate, the issue it would be very nice to have a real fix instead of having to rely on workarounds.
@hotplot I mean :
For me, the following worked (for Tensorflow 2.3.1):
input_tensor = tf.convert_to_tensor(input_ndarray)
output_tensor = model(input_tensor)
output_array = output_tensor.numpy()
After I used the above lines instead of
model.predict()
, the memory leakage was resolved.In tf 2.2.0 (conda-based) the problem still persists. K.clear_session() and gc.collect() and even using model.predict_on_batch together did not solve the problem. However, I was able to use
model(tf.convert_to_tensor(np_input))
, which decreased memory leakage drastically. Maybe this will help.for me change to a gpu with bigger memory size will just solve thie issue (gtx1650 4g -> p100 16g)
This still happens in tf 2.9.1. It stops leaking for me if I do - tf.keras.backend.clear_session() gc.collect()
This is the second bug within a couple weeks I have found in Keras that has been around for years and was closed by @fchollet with no explanation. Not sure what that’s all about.
Though the tf.convert_to_tensor could avoid continuous memory growth, an extremely large ndarray would be still failed to convert_to_tensor. My solution is warping the array to a generator, and then convert it to a Dataset object by using tf.data.Dataset.from_generator. For example,
Using the ‘args’ argument of Dataset.from_generator should be avoided, it would also convert the array to a tensor and would cause GPU memory leaking when handling an extremely large array.
@moha23 exactly, the leak is still there, just smaller
I’m also still facing the same issue with TensorFlow 2.1.0.
@wangyexiang Yes, I have run the following:
And get the following output:
The exception is unsurprising. I am making repeated predictions using the existing model instance, so it is an error to call
clear_session
.@Huii @Shane-Neeley @AverinLV
Hey all, so I had upgraded to 2.1.0, and also switched from the Anaconda version to the Pip version. This fixed my issue. My best guess then is that Anaconda’s version is where the issue is. Now, I have no idea why these versions are different…or where we would go to report this.
Thanks @cclaan , your solution works for me. Specifically, my environment ties to TF 2.0.0 temporarily so I can’t just upgrade to 2.1.0 for the fix. I hope there’s a patch for TF 2.0 to include this bugfix.
I have implemented a workaround based on
predict_on_batch
(thanks @novog for pointing this out). Note thatpredict_on_batch
expects equally sized batches as used for training the model. Here’s an example of how to loop through your batches of test data to create predictions:(Note that if only pre-allocating the results array already results in a MemoryError then simply the array does not fit in your available memory regardless of the memory leak issue.)
If you have GPU: just use
predict_on_batch()
and batch your predictions (a.k.a. put all the images you need the prediction for in one set). NO LOOP NEEDEDOn other systems: Using
predict_on_batch()
could be dangerous in terms of memory available, therefore do the same procedure as above withpredict
while specifying thebatch_size
N.B.:
predict()
is not intended to be used in a loop, the only exception to this rule could be that the image you will predict is dependent from earlier results.For me calling gc.collect() after each model.predict() worked.
convert the numpy object to tf.convert_to_tensor works for me.
more info https://stackoverflow.com/questions/64199384/tf-keras-model-predict-results-in-memory-leak
I created a workaround that should work in all cases of leak.
It’s a decorator that allows to run any function as a separated script, seamlessly. So when the script ends the memory that was allocated in the function is freed entirely. It automatically generates the script and takes care of passing the arguments and returns as long as they are pickleable or keras models or lists of keras models… (for documentation see: github)
You can install it with
pip install scriptifier
It should look like this:
I also have the same issue using LSTM on the CPU. It keeps leaking memory continuously…
I would like to use Keras_tuner to tuner hyperparameters, but it’s impossible. I even have a minimal dataset as it is about high-fidelity simulations.
I am going to try using the docker image as suggested by @UntotaufUrlaub and let him know. My last chance!
Uninstalling TF 2.3, installing TF-nightly 2.4 and reinstalling TF-2.3 seems to have fixed the issue 🤔.
None of the work arounds here seem to work for my network except
K.clear_session()
(this is how I used it). While using model.predict caused sharp jump in RAM usage within a couple of minutes,model(inputs, training=False)
has a much more gradual increase but increases nevertheless. Tf-gpu1.14. I think it could depend on the network architecture, I also get the topological sort error sometimes despite having no loops which seems to happen with more number of filters or some other unclear reason (issue #24816). So all these errors might be at play somehow.@JivanRoquet it should work. I just tested. So each model itself is functional, you can call on it by just passing your input. But remember if you wanna do inference, do
model(inputs, training=False)
to disable all the dropout etc. I tried all the solutions above and this is the only way that doesn’t leak memory. idk why but just to share my two cents experience@xiahualiu
What do you mean exactly? How are you supposed to do that?
I tried:
Second option triggers an error
In my case,
data
is a NumPy array resembling this (truncated):Should the Numpy array be directly converted to a Tensor?
Leak is still here despite doing exactly that. And it’s as massive as before.
Same with 2.1.0, 2.0.1, 2.0
For me on tf 2.0.0:
tf.keras.backend.clear_session()
after predict helped the memory leak but didn’t fix it completely.predict_on_batch
instead ofpredict
fixed the memory leak, but really slowed down my predictions. I can’t use this.I haven’t tried:
include_optimizer=False
in model.save. What does this accomplish?Apparently, a similar issue has been solved in TensorFlow 2.1.0 (dev) according to this issue: https://github.com/tensorflow/tensorflow/issues/34579
I will wait for the final version of TensorFlow 2.1.0 and see whether the bug still persists in Keras.