keras: Memory leak during model.fit()

Hi,

I am trying to train a simple CNN in keras. During training via model.fit(), the system free memory keeps reducing and eventually it runs out of memory with a “Killed” error. When I train it one epoch at a time, I can clearly see the reduction in free memory after each epoch. Is this normal?

model = Sequential()
model.add(Conv2D(input_shape=(1,30,300), filters=10, kernel_size=(3, 300), padding='valid', 
                        data_format="channels_first", activation='relu'))
model.add(Reshape((10, 28)))
model.add(MaxPooling1D(pool_size=10))

model.add(Flatten())
model.add(BatchNormalization())
model.add(Dense(20, activation='relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(Dense(10, activation='relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(Dense(5, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid')) 

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(X_train, Y_train, batch_size=32, epochs=1, validation_data=(X_test, Y_test), verbose=0)

However, when I set batch_size = 1, I see that it works fine with no memory leaks.

I am running keras version 2.02 with theano backend on CPU.

Thanks, Hareesh

Edit: I was not facing this issue with Keras version 1.2.

  • Check that you are up-to-date with the master branch of Keras. You can update with: pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps

  • If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.

  • If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with: pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps

  • Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 11
  • Comments: 60 (10 by maintainers)

Most upvoted comments

I had the same problem(tensorflow backend. python3.5, keras2.1.2)

I found the same problem with TF2.2 and tf.keras model.fit()

to solve ifelse problem, just
import theano.ifelse from theano.ifelse import IfElse, ifelse at the begining in theano_backend.py

BTW, I use theano 0.9

Python 3.6.1 Keras 2.0.2 Tensorflow 1.0.1 Ubuntu 16.04

I load data using pickle and had a similar memory leak when using model.fit

Hello everybody,

I’m new in the forum and I also face the same memory leak proble in ubuntu.

Features: OS Ubuntu 16.04.2 64 bit Keras 2.0.6 Theano 0.9.0

Solve:

  • I use the command line ‘sudo apt-get install libblas-dev’, it also install theano dependency of blas library and it didn’t have any memory leak again.

i’m using tf 2.3.0 and tf 2.4.0 and see this issue a lot. i dug into the source code and see there is a race condition issue which will keep on increasing memories. The issue is that for the training data generator, model.fit function only initialize one time and then there is a for loop in the class to iterate the data repeatedly forever. Every full epoch, it will close the process pool and the gc will start to work but not immediate if your data is huge. But at the same time, the fit function is creating a new validation generator immediate even before the gc finishes. this new validation generator will create a new process pool which will inherit the training generator’s leftover memory (python on linux by default uses copy mechanism to inherit shared memories) and then copy them into the new process pool. here is one layer of memory leak. Then once the validation finishes, the same thing happens with the new training generator process pool where it will copy the left over memory from the validation generator again. This is like rolling a snow ball and the memory keeps on growing.

i tried adding gc.collect() every time after one epoch but again the gc takes time. it doesn’t matter if you call it or not.

you can validate this by adding a sleep in on_test_begin and on_test_end, it can alleviate a little bit of this symptom. But for some reason some of the memories got never released as long as the OrderedEnqueuer object exists even the pool has been closed and i waited for very long. So i finally modified it so that for validation data, after every run, i just return the OrderedEnqueuer’s run function. then all the process created by the validation data generator will disappear when i check top. then although there are still memory leak between train -> validation, but since validation back to train is cut, memory leak disappeared.

I’m also having this issue with TF2.3.1. Using tf.compat.v1.disable_v2_behavior() fixed it.

@joelthchao No I don’t have any other operations other than this (just loading the data before this). I tried with your input shapes:

X_train = np.random.rand(20000, 1, 30, 300)
Y_train = np.random.randint(0, 2, size=20000)
X_test = np.random.rand(20000, 1, 30, 300)
Y_test = np.random.randint(0, 2, size=20000)

I get the callback out put as 14201068, 14201068, 14201068, 14201068, 14201068

However, I also monitor the memory usage with the command free -m from another screen as the model.fit() progresses. The output is:

capture

As you see, the free memory keeps decreasing and the process gets killed eventually (if it is run for too many epochs). The final free -m is after the script is completed and the program is exited. Note that I am not running any other processes.

Also, like I mentioned this free memory remains constant with batch_size=1.

Hi @joelthchao I get the following output with batch_size=32

14199908
14199908
15540832
18307928
21075688

With batch_size=1,

14199908
14199908                                                                                                                                                                                        
14199908
14199908
14199908                                                                                                                                                                                      

Moreover, when I monitor memory using free -m on the command line, I see a clear decline in the free memory as the training progresses, for batch sizes larger than 1.

I am running it on a server with AMD Opteron™ Processor 4284.

A similar issue has been raised by #5924

@justinmulli @yymarcin I suggest doing clear_session before defining any model. Also, do gc.collect() before del model (it works for me but I do not know why it would not work as well for after deleting it) after you are done using it.

Thanks for your reply- but as I mentioned in my comment, I already do those things I will try what yymarcin did to fix it

@justinmulli if that does not work, let me know if you actually followed the exact order and instructions I suggested. I do not know what “explicitly deleting objects” means, but using the python command del model after a call to gc.collect() could actually be important here.

I followed the exact order and instructions you suggested and it still does not work.

@justinmulli @yymarcin I suggest doing clear_session before defining any model. Also, do gc.collect() before del model (it works for me but I do not know why it would not work as well for after deleting it) after you are done using it.

I finally found the issue for me. Tensorflow 1.14 has this memory leak but 1.13 does not.

I had the same problem, Solved by switching to tensorflow backend.

I have the same problem Features: OS Ubuntu 16.04.2 64 bit Keras 2.0.6 Theano 0.9.0 Python 3

I tested the same code under python2, it does not have such issue, only for python 3.

@gzapatas His method can solve the leaking problem! Thanks…

Run sudo apt-get install libblas-dev python-dev Take a look at Theano official website, it requires “BLAS” installation: http://deeplearning.net/software/theano/install_ubuntu.html

If you run the program on GPU, other packages are also highly recommended:

  1. libgpuarray
  2. pycuda and skcuda
  3. etc.

The leak was masted in the master of Theano. I would recommand to link to a good blas. This will give you speed up at the same time.

Otherwise, update Theano to the dev version.

I can confirm the fix mentioned by @nouiz . Properly linking to MKL solved the problem. http://deeplearning.net/software/theano/troubleshooting.html#test-blas

There is a pr to fix this in Theano.

The problem only happen if Theano can’t link directly to BLAS. One work around that should also speed up computation is to install a good library that Theo can reuse.

Le lun. 10 avr. 2017 10:05, hft7h11 notifications@github.com a écrit :

@fchollet https://github.com/fchollet

I have commented on the Theano bug ticket. In the mean time this is a bit of a blocker for Keras 2.0 theano use. Reverting to Theano 0.8.2 fixes the memory leak, however certain layers such MaxPooling2D seem to depend on Theano 9.0 as per #5785 https://github.com/fchollet/keras/issues/5785

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/5935#issuecomment-292959155, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-_hI8Ppg_Yd22kPx1-6F40NBHgarks5rujcPgaJpZM4Ml--j .

I got the same issue. Keras 2.0.2 on Windows with Theano backend. The memory consumption keeps increasing and finally the program crashed.

Hi @HareeshBahuleyan, I use your model as target to monitor memory usage. Script:

# ENV: Macbook Pro 2012, keras: '2.0.1', theano: '0.9.0.dev-a4126bcced010b4bf0022ebef3e3080878adc480'
import resource
class MemoryCallback(Callback):
    def on_epoch_end(self, epoch, log={}):
        print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
# ...
X_train = np.random.rand(20000, 1, 30, 300)
Y_train = np.random.randint(0, 2, size=20000)
X_test = np.random.rand(20000, 1, 30, 300)
Y_test = np.random.randint(0, 2, size=20000)

model.fit(X_train, Y_train, batch_size=32, epochs=10,
          validation_data=(X_test, Y_test), verbose=0, callbacks=[MemoryCallback()])

Result shows that no obvious memory leak found. Can you add the callback to monitor memory usage on your own data?

3015524352
3019411456
3023294464
3024400384
3024400384
...