tensorflow: Memory leak on TF 2.0 with model.predict or/and model.fit with keras

System information

OS Platform:

System Version: macOS 10.14.6 (18G103) Kernel Version: Darwin 18.7.0

TensorFlow installed from binary using pip install tensorflow
Python version:

python -V                                                                                                                                                                                                      
Python 3.7.3

GPU model and memory: No GPU
TensorFlow version

python -c "import tensorflow as tf; print(tf.version.VERSION)"                                                                                                                                               
2.0.0

Describe the current behavior While running using tensorflow 1.14 or theano backends this code works fine. After upgraded to tensorflow 2.0.0 it stops working and memory usage increasing without finish the program.

Describe the expected behavior Using theano I get 28 seconds by iteration. Using tensorflow 2.0.0 I expect same behavior (or better).

Code to reproduce the issue

import gym
import numpy as np
import matplotlib.pylab as plt

import tensorflow as tf
from tensorflow.keras import layers

env = gym.make('NChain-v0')


def q_learning_keras(env, num_episodes=1000):
    # create the keras model
    model = tf.keras.Sequential()
    model.add(layers.InputLayer(batch_input_shape=(1, 5)))
    model.add(layers.Dense(10, activation='sigmoid'))
    model.add(layers.Dense(2, activation='linear'))
    model.compile(loss='mse', optimizer='adam', metrics=['mae'])
    # now execute the q learning
    y = 0.95
    eps = 0.5
    decay_factor = 0.999
    r_avg_list = []
    for i in range(num_episodes):
        s = env.reset()
        eps *= decay_factor
        if i % 100 == 0:
            print("Episode {} of {}".format(i + 1, num_episodes))
        done = False
        r_sum = 0
        while not done:
            if np.random.random() < eps:
                a = np.random.randint(0, 2)
            else:
                a = np.argmax(model.predict(np.identity(5)[s:s + 1]))
            new_s, r, done, _ = env.step(a)
            target = r + y * np.max(model.predict(np.identity(5)[new_s:new_s + 1]))
            target_vec = model.predict(np.identity(5)[s:s + 1])[0]
            target_vec[a] = target
            model.fit(np.identity(5)[s:s + 1], target_vec.reshape(-1, 2), epochs=1, verbose=0)
            s = new_s
            r_sum += r
        r_avg_list.append(r_sum / 1000)
    plt.plot(r_avg_list)
    plt.ylabel('Average reward per game')
    plt.xlabel('Number of games')
    plt.show()
    for i in range(5):
        print("State {} - action {}".format(i, model.predict(np.identity(5)[i:i + 1])))


if __name__ == "__main__":
    q_learning_keras(env)

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 29
Comments: 92 (11 by maintainers)

Commits related to this issue

v0.9.1: tensorflow model.predict() memleak fix/workaround https://github.com/tensorflow/tensorflow/issues/33030 — committed to gereoffy/deepspam1 by arpitest 2 years ago

Most upvoted comments

Hi, I am still facing the same issue. I have tested on TensorFlow 2.0, 2.2 and 2.3, with Windows (python 3.7) and Ubuntu (python 3.8) and there is a memory leak using model.predict().

I have built an API and my model is on the server side. Every time I call my prediction function using a simple request, the instance memory grows up.

Using tf.keras.backend.clear_session() didn’t solve the issue.

+25

jkintzinger on Aug 6, 2020

I have that same problem

+20

birkanatici on Oct 6, 2019

Same problem here as well. No issues with 1.14 but suddenly appeared when I installed 2.0.

+15

TKassis on Oct 6, 2019

I am using tensorflow-gpu==2.0.0 I spent 1 full day in checking my code for memory leaks and eventually I spotted the leak with - tf.model.predict(...)

By the way, I used memory_profiler for profiling my code.
Thank God ! I am not alone with this issue.

+14

bgswaroop on Nov 5, 2019

Unfortunately, it is not fully fixed for me in 2.1.0-rc0. The memory usage for the code posted by @ipsec has greatly improved but it works almost 10x slower with 2.1.0-rc0 than with 1.14.0. @rchao, should I post more details here or open a new issue?

+12

gdudziuk on Dec 8, 2019

For me the memory leak is not fixed in tensorflow 2.1.0, Windows 10 and Python 3.7.6 (64 bits). After using tf.keras.backend.clear_session(), the memory leak is fixed as a workaround.

taborda11 on Feb 5, 2020

This issue is NOT closed!

Emile0205 on Oct 27, 2020

It didn’t happen in tensorflow 2.0 alpha, but in 2.0.

!pip install tensorflow-gpu==2.0.0 : got memory leak !pip install tensorflow-gpu==2.0.0-alpha : everything’s fine

greentec on Oct 17, 2019

I can confirm that this issue persists with model.train_on_batch. The situation is more severe without a GPU. I am using the latest version 2.3.1 with git version 2.3.0-54-gfcc4b966f1. And tf.keras.backend.clear_session() doesn’t work for me. One workaround is to write everything using custom training loop. But this is not desired.

vermouth1992 on Oct 16, 2020

Thanks for checking in! We’re still verifying the fix solves the issue and should have updates soon.

rchao on Nov 12, 2019

train_on_batch still has a memory leak… del and gc.collect() does not work

Emile0205 on Nov 10, 2020

Hi dogacbasaran,

So far, none of the workarounds work for me including clear_session().

I can confirm, that even calling clear_session() of the Keras-backend did not solve the memory leak for me neither. Only restarting the Python-Kernel worked for me.

Currently the memory leak in TF 2.0 and TF 2.1 (I run it under Ubuntu 18.04) limits how many epochs I can train with the Keras backend fit() method.

I hope the memory leak will be fixed in the next TF version.

With TF 1.x + Keras as a separate library I had never such memory leak problems.

juebrauer on Feb 21, 2020

I’m using tensorflow-gpu 2.1.0 and problem is still not solved. Have anybody faced same issue?

dishkakrauch on Jan 20, 2020

For me the memory leak is not fixed in tensorflow 2.1.0, Windows 10 and Python 3.7.6 (64 bits). After using tf.keras.backend.clear_session(), the memory leak is fixed as a workaround.

Thank you so much! That totally solved my problem in tf 2.2.0!

Hello! You said that tf 2.2.0 with using tf.keras.backend.clear_session() solve your problem. How did you solve it? Is it call the tf.keras.backend.clear_session() once after import TensorFlow? Or every single step use it?Would you mind provide some code example or information?

luvwinnie on Jul 12, 2020

I can’t get a work around to be successful because I’m using combined models for GANs. This is very annoying.

Emile0205 on Nov 10, 2020

@arvoelke I think this issue wasn’t solve yet, the problem of tf.data.Dataset.from_generator was mentioned in this issue #37653

luvwinnie on Jul 12, 2020

Unfortunately yea. There’s definitely a memory leak (at least for combined models) in these functions. I’m assuming it isn’t being addressed cus everyone using complicated/combined models are using custom training loops

Emile0205 on Mar 8, 2021

@emuccino could you elaborate on your solution/provide example code? Conversion to a data set with tf.data.Dataset.from_generator() lead to poor performance (although the memory leak is fixed then as well).

phiwei on Mar 27, 2020

@gdudziuk I still use fit_generator because I need to use a generator for validation as well but Model.fit does not allow it. It expects a validation set. It is also not possible to input a generator for training data and use a part of it as validation.

dogacbasaran on Feb 21, 2020

The memory leak occurs for me on TF 2.1.0 on Ubuntu 18.04 with Python 3.6.10. The memory leak occurs in fit_generator. I’m using generators for both training and validation, it seems that the validation generator generates much more batches than needed. It starts filling up the cache memory until it crashes. So far, none of the workarounds work for me including clear_session(). I didn’t have any problems using generators in TF 1.13.1 (no leak) but I need to use Tensorflow-addons that works only for > TF 2.0. I implemented a Sequence but still leak continues. Does anyone still have this issue or any more workarounds? I think about using train_on_batch instead of generators.

dogacbasaran on Feb 21, 2020

I have to unsubscribe from this thread. Final suggestion: maybe it’s time to switch.

vermouth1992 on Dec 26, 2022

I’m getting heavy mem leaking after upgrading to ubuntu22+python3.10+tf2.11, using only tf.keras’ model.predict() calls (many times per second).

calling tf.keras.backend.clear_session() after every model.predict() fixes it for me, but i don’t understand why is it required, older versions work fine without this…

also tried sess=get_session() and then ‘with sess.as_default(): res=model.predict(input)’ but it didnt fix the leaking now. (this was the memleak fix/workaround for early tf 2.x versions with keras)

arpitest on Dec 26, 2022

Hi, I am still facing the same issue. I have tested on TensorFlow 2.0, 2.2 and 2.3, with Windows (python 3.7) and Ubuntu (python 3.8) and there is a memory leak using model.predict().

I have built an API and my model is on the server side. Every time I call my prediction function using a simple request, the instance memory grows up.

Using tf.keras.backend.clear_session() didn’t solve the issue.

I hit the same issue, but if I found that using model(input) rather than model.predict(input) is a workaround and produces correct result. I verified this by converting the model to TensorRT and comparing results.

MurrayData on Aug 21, 2020

My workaround for the memory leak in dataset.map is not to use tf.data API - stick with Keras Sequence data generators. And a workaround for the leak in model.predict() - call model.predict_on_batch(), which does not have a leak.

ychervonyi on Nov 17, 2019

Will the fixes be added to the current release any time soon?

NMVRodrigues on Nov 12, 2019

Hi, there were a couple of recent fixes related to this https://github.com/tensorflow/tensorflow/commit/082415b2ff49bfb8890f7d5361585bac04749add https://github.com/tensorflow/tensorflow/commit/c2fc448fe253bc59d3f0417d7d08e16d53f2a856

kkimdev on Nov 7, 2019

I reported the same issue here with an even simpler test case on 2.0.0

https://github.com/tensorflow/tensorflow/issues/32500

LuisSaybe on Oct 9, 2019