tensorflow: Memory leak on TF 2.0 with model.predict or/and model.fit with keras

System information

  • OS Platform:

System Version: macOS 10.14.6 (18G103) Kernel Version: Darwin 18.7.0

  • TensorFlow installed from binary using pip install tensorflow

  • Python version:

python -V                                                                                                                                                                                                      
Python 3.7.3
  • GPU model and memory: No GPU

  • TensorFlow version

python -c "import tensorflow as tf; print(tf.version.VERSION)"                                                                                                                                               
2.0.0

Describe the current behavior While running using tensorflow 1.14 or theano backends this code works fine. After upgraded to tensorflow 2.0.0 it stops working and memory usage increasing without finish the program.

Describe the expected behavior Using theano I get 28 seconds by iteration. Using tensorflow 2.0.0 I expect same behavior (or better).

Code to reproduce the issue

import gym
import numpy as np
import matplotlib.pylab as plt

import tensorflow as tf
from tensorflow.keras import layers

env = gym.make('NChain-v0')


def q_learning_keras(env, num_episodes=1000):
    # create the keras model
    model = tf.keras.Sequential()
    model.add(layers.InputLayer(batch_input_shape=(1, 5)))
    model.add(layers.Dense(10, activation='sigmoid'))
    model.add(layers.Dense(2, activation='linear'))
    model.compile(loss='mse', optimizer='adam', metrics=['mae'])
    # now execute the q learning
    y = 0.95
    eps = 0.5
    decay_factor = 0.999
    r_avg_list = []
    for i in range(num_episodes):
        s = env.reset()
        eps *= decay_factor
        if i % 100 == 0:
            print("Episode {} of {}".format(i + 1, num_episodes))
        done = False
        r_sum = 0
        while not done:
            if np.random.random() < eps:
                a = np.random.randint(0, 2)
            else:
                a = np.argmax(model.predict(np.identity(5)[s:s + 1]))
            new_s, r, done, _ = env.step(a)
            target = r + y * np.max(model.predict(np.identity(5)[new_s:new_s + 1]))
            target_vec = model.predict(np.identity(5)[s:s + 1])[0]
            target_vec[a] = target
            model.fit(np.identity(5)[s:s + 1], target_vec.reshape(-1, 2), epochs=1, verbose=0)
            s = new_s
            r_sum += r
        r_avg_list.append(r_sum / 1000)
    plt.plot(r_avg_list)
    plt.ylabel('Average reward per game')
    plt.xlabel('Number of games')
    plt.show()
    for i in range(5):
        print("State {} - action {}".format(i, model.predict(np.identity(5)[i:i + 1])))


if __name__ == "__main__":
    q_learning_keras(env)

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 29
  • Comments: 92 (11 by maintainers)

Commits related to this issue

Most upvoted comments

Hi, I am still facing the same issue. I have tested on TensorFlow 2.0, 2.2 and 2.3, with Windows (python 3.7) and Ubuntu (python 3.8) and there is a memory leak using model.predict().

I have built an API and my model is on the server side. Every time I call my prediction function using a simple request, the instance memory grows up.

Using tf.keras.backend.clear_session() didn’t solve the issue.

I have that same problem

Same problem here as well. No issues with 1.14 but suddenly appeared when I installed 2.0.

I am using tensorflow-gpu==2.0.0 I spent 1 full day in checking my code for memory leaks and eventually I spotted the leak with - tf.model.predict(...)

By the way, I used memory_profiler for profiling my code.
Thank God ! I am not alone with this issue.

Unfortunately, it is not fully fixed for me in 2.1.0-rc0. The memory usage for the code posted by @ipsec has greatly improved but it works almost 10x slower with 2.1.0-rc0 than with 1.14.0. @rchao, should I post more details here or open a new issue?

For me the memory leak is not fixed in tensorflow 2.1.0, Windows 10 and Python 3.7.6 (64 bits). After using tf.keras.backend.clear_session(), the memory leak is fixed as a workaround.

This issue is NOT closed!

It didn’t happen in tensorflow 2.0 alpha, but in 2.0.

!pip install tensorflow-gpu==2.0.0 : got memory leak !pip install tensorflow-gpu==2.0.0-alpha : everything’s fine

I can confirm that this issue persists with model.train_on_batch. The situation is more severe without a GPU. I am using the latest version 2.3.1 with git version 2.3.0-54-gfcc4b966f1. And tf.keras.backend.clear_session() doesn’t work for me. One workaround is to write everything using custom training loop. But this is not desired.

Thanks for checking in! We’re still verifying the fix solves the issue and should have updates soon.

train_on_batch still has a memory leak… del and gc.collect() does not work

Hi dogacbasaran,

So far, none of the workarounds work for me including clear_session().

I can confirm, that even calling clear_session() of the Keras-backend did not solve the memory leak for me neither. Only restarting the Python-Kernel worked for me.

Currently the memory leak in TF 2.0 and TF 2.1 (I run it under Ubuntu 18.04) limits how many epochs I can train with the Keras backend fit() method.

I hope the memory leak will be fixed in the next TF version.

With TF 1.x + Keras as a separate library I had never such memory leak problems.

I’m using tensorflow-gpu 2.1.0 and problem is still not solved. Have anybody faced same issue?

For me the memory leak is not fixed in tensorflow 2.1.0, Windows 10 and Python 3.7.6 (64 bits). After using tf.keras.backend.clear_session(), the memory leak is fixed as a workaround.

Thank you so much! That totally solved my problem in tf 2.2.0!

Hello! You said that tf 2.2.0 with using tf.keras.backend.clear_session() solve your problem. How did you solve it? Is it call the tf.keras.backend.clear_session() once after import TensorFlow? Or every single step use it?Would you mind provide some code example or information?

I can’t get a work around to be successful because I’m using combined models for GANs. This is very annoying.

@arvoelke I think this issue wasn’t solve yet, the problem of tf.data.Dataset.from_generator was mentioned in this issue #37653

Unfortunately yea. There’s definitely a memory leak (at least for combined models) in these functions. I’m assuming it isn’t being addressed cus everyone using complicated/combined models are using custom training loops

@emuccino could you elaborate on your solution/provide example code? Conversion to a data set with tf.data.Dataset.from_generator() lead to poor performance (although the memory leak is fixed then as well).

@gdudziuk I still use fit_generator because I need to use a generator for validation as well but Model.fit does not allow it. It expects a validation set. It is also not possible to input a generator for training data and use a part of it as validation.

The memory leak occurs for me on TF 2.1.0 on Ubuntu 18.04 with Python 3.6.10. The memory leak occurs in fit_generator. I’m using generators for both training and validation, it seems that the validation generator generates much more batches than needed. It starts filling up the cache memory until it crashes. So far, none of the workarounds work for me including clear_session(). I didn’t have any problems using generators in TF 1.13.1 (no leak) but I need to use Tensorflow-addons that works only for > TF 2.0. I implemented a Sequence but still leak continues. Does anyone still have this issue or any more workarounds? I think about using train_on_batch instead of generators.

I have to unsubscribe from this thread. Final suggestion: maybe it’s time to switch.

I’m getting heavy mem leaking after upgrading to ubuntu22+python3.10+tf2.11, using only tf.keras’ model.predict() calls (many times per second).

calling tf.keras.backend.clear_session() after every model.predict() fixes it for me, but i don’t understand why is it required, older versions work fine without this…

also tried sess=get_session() and then ‘with sess.as_default(): res=model.predict(input)’ but it didnt fix the leaking now. (this was the memleak fix/workaround for early tf 2.x versions with keras)

Hi, I am still facing the same issue. I have tested on TensorFlow 2.0, 2.2 and 2.3, with Windows (python 3.7) and Ubuntu (python 3.8) and there is a memory leak using model.predict().

I have built an API and my model is on the server side. Every time I call my prediction function using a simple request, the instance memory grows up.

Using tf.keras.backend.clear_session() didn’t solve the issue.

I hit the same issue, but if I found that using model(input) rather than model.predict(input) is a workaround and produces correct result. I verified this by converting the model to TensorRT and comparing results.

My workaround for the memory leak in dataset.map is not to use tf.data API - stick with Keras Sequence data generators. And a workaround for the leak in model.predict() - call model.predict_on_batch(), which does not have a leak.

Will the fixes be added to the current release any time soon?

I reported the same issue here with an even simpler test case on 2.0.0

https://github.com/tensorflow/tensorflow/issues/32500