tensorflow: Memory leak on TF 2.0 with model.predict or/and model.fit with keras
System information
- OS Platform:
System Version: macOS 10.14.6 (18G103) Kernel Version: Darwin 18.7.0
-
TensorFlow installed from binary using
pip install tensorflow -
Python version:
python -V
Python 3.7.3
-
GPU model and memory: No GPU
-
TensorFlow version
python -c "import tensorflow as tf; print(tf.version.VERSION)"
2.0.0
Describe the current behavior While running using tensorflow 1.14 or theano backends this code works fine. After upgraded to tensorflow 2.0.0 it stops working and memory usage increasing without finish the program.
Describe the expected behavior Using theano I get 28 seconds by iteration. Using tensorflow 2.0.0 I expect same behavior (or better).
Code to reproduce the issue
import gym
import numpy as np
import matplotlib.pylab as plt
import tensorflow as tf
from tensorflow.keras import layers
env = gym.make('NChain-v0')
def q_learning_keras(env, num_episodes=1000):
# create the keras model
model = tf.keras.Sequential()
model.add(layers.InputLayer(batch_input_shape=(1, 5)))
model.add(layers.Dense(10, activation='sigmoid'))
model.add(layers.Dense(2, activation='linear'))
model.compile(loss='mse', optimizer='adam', metrics=['mae'])
# now execute the q learning
y = 0.95
eps = 0.5
decay_factor = 0.999
r_avg_list = []
for i in range(num_episodes):
s = env.reset()
eps *= decay_factor
if i % 100 == 0:
print("Episode {} of {}".format(i + 1, num_episodes))
done = False
r_sum = 0
while not done:
if np.random.random() < eps:
a = np.random.randint(0, 2)
else:
a = np.argmax(model.predict(np.identity(5)[s:s + 1]))
new_s, r, done, _ = env.step(a)
target = r + y * np.max(model.predict(np.identity(5)[new_s:new_s + 1]))
target_vec = model.predict(np.identity(5)[s:s + 1])[0]
target_vec[a] = target
model.fit(np.identity(5)[s:s + 1], target_vec.reshape(-1, 2), epochs=1, verbose=0)
s = new_s
r_sum += r
r_avg_list.append(r_sum / 1000)
plt.plot(r_avg_list)
plt.ylabel('Average reward per game')
plt.xlabel('Number of games')
plt.show()
for i in range(5):
print("State {} - action {}".format(i, model.predict(np.identity(5)[i:i + 1])))
if __name__ == "__main__":
q_learning_keras(env)
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 29
- Comments: 92 (11 by maintainers)
Commits related to this issue
- v0.9.1: tensorflow model.predict() memleak fix/workaround https://github.com/tensorflow/tensorflow/issues/33030 — committed to gereoffy/deepspam1 by arpitest 2 years ago
Hi, I am still facing the same issue. I have tested on TensorFlow 2.0, 2.2 and 2.3, with Windows (python 3.7) and Ubuntu (python 3.8) and there is a memory leak using model.predict().
I have built an API and my model is on the server side. Every time I call my prediction function using a simple request, the instance memory grows up.
Using tf.keras.backend.clear_session() didn’t solve the issue.
I have that same problem
Same problem here as well. No issues with 1.14 but suddenly appeared when I installed 2.0.
I am using
tensorflow-gpu==2.0.0I spent 1 full day in checking my code for memory leaks and eventually I spotted the leak with -tf.model.predict(...)By the way, I used memory_profiler for profiling my code.
Thank God ! I am not alone with this issue.
Unfortunately, it is not fully fixed for me in
2.1.0-rc0. The memory usage for the code posted by @ipsec has greatly improved but it works almost 10x slower with2.1.0-rc0than with1.14.0. @rchao, should I post more details here or open a new issue?For me the memory leak is not fixed in tensorflow 2.1.0, Windows 10 and Python 3.7.6 (64 bits). After using tf.keras.backend.clear_session(), the memory leak is fixed as a workaround.
This issue is NOT closed!
It didn’t happen in tensorflow 2.0 alpha, but in 2.0.
!pip install tensorflow-gpu==2.0.0: got memory leak!pip install tensorflow-gpu==2.0.0-alpha: everything’s fineI can confirm that this issue persists with model.train_on_batch. The situation is more severe without a GPU. I am using the latest version 2.3.1 with git version 2.3.0-54-gfcc4b966f1. And tf.keras.backend.clear_session() doesn’t work for me. One workaround is to write everything using custom training loop. But this is not desired.
Thanks for checking in! We’re still verifying the fix solves the issue and should have updates soon.
train_on_batch still has a memory leak… del and gc.collect() does not work
Hi dogacbasaran,
I can confirm, that even calling clear_session() of the Keras-backend did not solve the memory leak for me neither. Only restarting the Python-Kernel worked for me.
Currently the memory leak in TF 2.0 and TF 2.1 (I run it under Ubuntu 18.04) limits how many epochs I can train with the Keras backend fit() method.
I hope the memory leak will be fixed in the next TF version.
With TF 1.x + Keras as a separate library I had never such memory leak problems.
I’m using tensorflow-gpu 2.1.0 and problem is still not solved. Have anybody faced same issue?
Hello! You said that tf 2.2.0 with using tf.keras.backend.clear_session() solve your problem. How did you solve it? Is it call the tf.keras.backend.clear_session() once after import TensorFlow? Or every single step use it?Would you mind provide some code example or information?
I can’t get a work around to be successful because I’m using combined models for GANs. This is very annoying.
@arvoelke I think this issue wasn’t solve yet, the problem of tf.data.Dataset.from_generator was mentioned in this issue #37653
Unfortunately yea. There’s definitely a memory leak (at least for combined models) in these functions. I’m assuming it isn’t being addressed cus everyone using complicated/combined models are using custom training loops
@emuccino could you elaborate on your solution/provide example code? Conversion to a data set with tf.data.Dataset.from_generator() lead to poor performance (although the memory leak is fixed then as well).
@gdudziuk I still use fit_generator because I need to use a generator for validation as well but Model.fit does not allow it. It expects a validation set. It is also not possible to input a generator for training data and use a part of it as validation.
The memory leak occurs for me on TF 2.1.0 on Ubuntu 18.04 with Python 3.6.10. The memory leak occurs in fit_generator. I’m using generators for both training and validation, it seems that the validation generator generates much more batches than needed. It starts filling up the cache memory until it crashes. So far, none of the workarounds work for me including clear_session(). I didn’t have any problems using generators in TF 1.13.1 (no leak) but I need to use Tensorflow-addons that works only for > TF 2.0. I implemented a Sequence but still leak continues. Does anyone still have this issue or any more workarounds? I think about using train_on_batch instead of generators.
I have to unsubscribe from this thread. Final suggestion: maybe it’s time to switch.
I’m getting heavy mem leaking after upgrading to ubuntu22+python3.10+tf2.11, using only tf.keras’ model.predict() calls (many times per second).
calling tf.keras.backend.clear_session() after every model.predict() fixes it for me, but i don’t understand why is it required, older versions work fine without this…
also tried sess=get_session() and then ‘with sess.as_default(): res=model.predict(input)’ but it didnt fix the leaking now. (this was the memleak fix/workaround for early tf 2.x versions with keras)
I hit the same issue, but if I found that using model(input) rather than model.predict(input) is a workaround and produces correct result. I verified this by converting the model to TensorRT and comparing results.
My workaround for the memory leak in
dataset.mapis not to usetf.dataAPI - stick with Keras Sequence data generators. And a workaround for the leak inmodel.predict()- callmodel.predict_on_batch(), which does not have a leak.Will the fixes be added to the current release any time soon?
Hi, there were a couple of recent fixes related to this https://github.com/tensorflow/tensorflow/commit/082415b2ff49bfb8890f7d5361585bac04749add https://github.com/tensorflow/tensorflow/commit/c2fc448fe253bc59d3f0417d7d08e16d53f2a856
I reported the same issue here with an even simpler test case on 2.0.0
https://github.com/tensorflow/tensorflow/issues/32500