keras: memory leak when using tensorflow

Hello.

When using tensorflow, all ops are entered into the global tf graph. This results in memory leaks and loooong compilation times when building several models, one after the other, in the same python process (think ipython, cross validation, etc.)

For now, I solve this on my end by doing the following:

import keras.backend.tensorflow_backend
if keras.backend.tensorflow_backend._SESSION:
   import tensorflow as tf
   tf.reset_default_graph() 
   keras.backend.tensorflow_backend._SESSION.close()
   keras.backend.tensorflow_backend._SESSION = None

Maybe we should incorporate this into a keras.reset() function?

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 18
  • Comments: 52 (11 by maintainers)

Commits related to this issue

Most upvoted comments

You can now use K.clear_session() when using TensorFlow, which will clean up everything. This is recommended if you ever create models inside a loop.

Here is sample code, and the results:

from keras.models import Sequential                                                                                                                
from keras.layers.core import Dense, Activation                                                                                                    
import os                                                                                                                                          
import psutil                                                                                                                                      
import timeit                                                                                                                                      
import gc                                                                                                                                                   

def get_mem_usage():                                                                                                                               
    process = psutil.Process(os.getpid())                                                                                                          
    return process.memory_info()                                                                                                                   


def build():                                                                                                                                       
    model = Sequential()                                                                                                                           
    model.add(Dense(output_dim=4096, input_dim=4096, init="glorot_uniform"))                                                                       
    model.add(Activation("relu"))                                                                                                                  
    model.compile(loss='categorical_crossentropy', optimizer='sgd')                                                                                
    return model                                                                                                                                   


if __name__ == '__main__':                                                                                                                         
    for i in xrange(10): 
        gc.collect()                                                                                                                          
        t = timeit.timeit('build()', number=1, setup="from __main__ import build")                                                                 
        mem = get_mem_usage()                                                                                                                      
        print('build time: {}, mem: {}'.format(t, mem))                   

results:

Using TensorFlow backend.
build time: 1.02965593338, mem: pmem(rss=599789568, vms=1527300096)
build time: 1.0096321106, mem: pmem(rss=1141383168, vms=2068729856)
build time: 1.03104996681, mem: pmem(rss=1682370560, vms=2610061312)
build time: 1.0659198761, mem: pmem(rss=2223833088, vms=3151384576)
build time: 1.08011817932, mem: pmem(rss=2765127680, vms=3692707840)
build time: 1.10519003868, mem: pmem(rss=3306053632, vms=4233703424)
build time: 1.13465809822, mem: pmem(rss=3847581696, vms=4775194624)
build time: 1.14798998833, mem: pmem(rss=4387577856, vms=5314605056)
build time: 1.17501521111, mem: pmem(rss=4929052672, vms=5856210944)
build time: 1.25362706184, mem: pmem(rss=5469794304, vms=6396817408)

notice compilation time and mem usage going up. After cleaning the default graph between iterations, these are the results:

Using TensorFlow backend.
build time: 0.988173961639, mem: pmem(rss=598212608, vms=1527754752)
build time: 0.976176023483, mem: pmem(rss=598134784, vms=1527767040)
build time: 0.973516941071, mem: pmem(rss=598507520, vms=1528115200)
build time: 0.975924968719, mem: pmem(rss=598638592, vms=1528377344)
build time: 0.975230932236, mem: pmem(rss=599068672, vms=1528639488)
build time: 0.976888895035, mem: pmem(rss=599187456, vms=1528623104)
build time: 0.978793144226, mem: pmem(rss=599056384, vms=1528639488)
build time: 0.975780010223, mem: pmem(rss=598925312, vms=1528647680)
build time: 0.977483987808, mem: pmem(rss=598794240, vms=1528639488)
build time: 0.974485874176, mem: pmem(rss=599236608, vms=1528623104)

import ... as K
import gc

model = ....
del model
K.clear_session()
gc.collect()

it may work.

Now we are using the Keras 2.1.5 and the problem exists and does not get resolved by K.crear_session()

I’m still seeing this issue with: TensorFlow Version: 1.13.1 TensorFlow.keras Version: 2.2.4-tf OS: Windows 10 TensorFlow-GPU running on: NVIDIA GTX 1080 ti

I’ve tried tf.keras.backend.clear_session() with no luck, still hitting RAM OOM errors eventually. I’ve also tried manually invoking garbage collection with no luck.

I should note that tf.keras.backend.clear_session() does result in a visible drop in RAM, but the next call to Model.fit(...) during looping, consumes more memory than was freed during the initial call to tf.keras.backend.clear_session(). I should also note that I am using TensorFlow datasets with one-shot iterators during training.

I haven’t been able to pinpoint why this happens. But I know the problem occurs when I call Model.fit(...) on my Keras model with the two one-shot-iterators in a repeated loop. If i just initialize the one-shot iterators and don’t fit the Keras model (only compile the model) then the memory usage is uniform. As soon as Model.fit(...) is called with train_ds.make_one_shot_iterator() and val_ds.make_one_shot_iterator(), I slowly leak RAM despite calling tf.keras.backend.clear_session() at the beginning of the loop.

Has anyone encountered this issue while directly fitting the Keras model to TensorFlow data generators? I’m trying not to downgrade too far due to the TensorFlow generator support in the more recent releases.

I’m working on an [mcve], but my code is still a bit lengthy to post.

I can confirm this problem with Keras 2.2.2 and Tensorflow 1.8.

I downgraded Keras to version 2.1.6, and the problem is gone.

Here is a pattern I adopted when fighting OOM that in retrospect may have caused OOM on its own:

model = load_model(...)
# predictions
del model   
K.clear_session()
model = load_model(...)
# predictions

I suspect that is why I was hitting OOM after my first del/clear_session(): deleting the model may deprive TF of info it needs to clear the session properly.

Now I am not reloading the model anyway, and the original OOM seems to be gone, maybe due to newer versions of everything. I’m not testing that ‘del model’ before clear_session() caused the latest memory leak, because it takes a while, but I recommend anyone using that sort of pattern try deleting things after the clear_session():

K.clear_session()
del model
model = load_model(...)

Beware of adoption becoming maladaptation. 😃

I think keras uses default session(probably), So I have set session manually and then called K.clear_session() which is working fine as below.

from keras import backend as K
cfg = K.tf.ConfigProto()
cfg.gpu_options.allow_growth = True
K.set_session(K.tf.Session(config=cfg))

# training / validation part ....

K.clear_session()

# loading another model ....

Hi,

Try

from keras import backend as be (…) be.clear_session()

I run into OOM exceptions while using KerasClassifier to sweep large hyperparameter grids with TF backend. No problems with Theano.

Not exactly sure why this issue has been closed.

What can be done to mitigate the growing loading time when calling load_model sequentially?

E.g. having ten different models that need to be loaded in memory, which means that using clear_session() is not an option here.

import keras
from keras.model import load_model
keras.backend.clear_session()

files = ['model1.h5', 'model2.h5', 'model3.h5', 'model4.h5', '...']

models = [load_model(f) for f in files]
# each model takes 30 seconds more than the previous one to load
# in particular, models 9 or 10 really take ages to load

do_something_with(models)

We got the same problem in a loop for a sklearn kfold experiment. No problem switching to Theano.