keras: Memory leak during model.fit()
Hi,
I am trying to train a simple CNN in keras. During training via model.fit(), the system free memory keeps reducing and eventually it runs out of memory with a “Killed” error. When I train it one epoch at a time, I can clearly see the reduction in free memory after each epoch. Is this normal?
model = Sequential()
model.add(Conv2D(input_shape=(1,30,300), filters=10, kernel_size=(3, 300), padding='valid',
data_format="channels_first", activation='relu'))
model.add(Reshape((10, 28)))
model.add(MaxPooling1D(pool_size=10))
model.add(Flatten())
model.add(BatchNormalization())
model.add(Dense(20, activation='relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(Dense(10, activation='relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(Dense(5, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(X_train, Y_train, batch_size=32, epochs=1, validation_data=(X_test, Y_test), verbose=0)
However, when I set batch_size = 1, I see that it works fine with no memory leaks.
I am running keras version 2.02 with theano backend on CPU.
Thanks, Hareesh
Edit: I was not facing this issue with Keras version 1.2.
-
Check that you are up-to-date with the master branch of Keras. You can update with: pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
-
If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.
-
If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with: pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps
-
Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 11
- Comments: 60 (10 by maintainers)
I had the same problem(tensorflow backend. python3.5, keras2.1.2)
I found the same problem with TF2.2 and tf.keras model.fit()
to solve ifelse problem, just
import theano.ifelse from theano.ifelse import IfElse, ifelse at the begining in theano_backend.py
BTW, I use theano 0.9
Python 3.6.1 Keras 2.0.2 Tensorflow 1.0.1 Ubuntu 16.04
I load data using pickle and had a similar memory leak when using model.fit
Hello everybody,
I’m new in the forum and I also face the same memory leak proble in ubuntu.
Features: OS Ubuntu 16.04.2 64 bit Keras 2.0.6 Theano 0.9.0
Solve:
i’m using tf 2.3.0 and tf 2.4.0 and see this issue a lot. i dug into the source code and see there is a race condition issue which will keep on increasing memories. The issue is that for the training data generator, model.fit function only initialize one time and then there is a for loop in the class to iterate the data repeatedly forever. Every full epoch, it will close the process pool and the gc will start to work but not immediate if your data is huge. But at the same time, the fit function is creating a new validation generator immediate even before the gc finishes. this new validation generator will create a new process pool which will inherit the training generator’s leftover memory (python on linux by default uses copy mechanism to inherit shared memories) and then copy them into the new process pool. here is one layer of memory leak. Then once the validation finishes, the same thing happens with the new training generator process pool where it will copy the left over memory from the validation generator again. This is like rolling a snow ball and the memory keeps on growing.
i tried adding
gc.collect()
every time after one epoch but again the gc takes time. it doesn’t matter if you call it or not.you can validate this by adding a sleep in
on_test_begin
andon_test_end
, it can alleviate a little bit of this symptom. But for some reason some of the memories got never released as long as theOrderedEnqueuer
object exists even the pool has been closed and i waited for very long. So i finally modified it so that for validation data, after every run, i just return theOrderedEnqueuer
’srun
function. then all the process created by the validation data generator will disappear when i checktop
. then although there are still memory leak between train -> validation, but since validation back to train is cut, memory leak disappeared.I’m also having this issue with TF2.3.1. Using tf.compat.v1.disable_v2_behavior() fixed it.
@joelthchao No I don’t have any other operations other than this (just loading the data before this). I tried with your input shapes:
I get the callback out put as
14201068, 14201068, 14201068, 14201068, 14201068
However, I also monitor the memory usage with the command
free -m
from another screen as the model.fit() progresses. The output is:As you see, the free memory keeps decreasing and the process gets killed eventually (if it is run for too many epochs). The final
free -m
is after the script is completed and the program is exited. Note that I am not running any other processes.Also, like I mentioned this free memory remains constant with
batch_size=1
.Hi @joelthchao I get the following output with batch_size=32
With batch_size=1,
Moreover, when I monitor memory using
free -m
on the command line, I see a clear decline in the free memory as the training progresses, for batch sizes larger than 1.I am running it on a server with AMD Opteron™ Processor 4284.
A similar issue has been raised by #5924
I followed the exact order and instructions you suggested and it still does not work.
@justinmulli @yymarcin I suggest doing
clear_session
before defining any model. Also, dogc.collect()
beforedel model
(it works for me but I do not know why it would not work as well for after deleting it) after you are done using it.I finally found the issue for me. Tensorflow 1.14 has this memory leak but 1.13 does not.
I had the same problem, Solved by switching to tensorflow backend.
I have the same problem Features: OS Ubuntu 16.04.2 64 bit Keras 2.0.6 Theano 0.9.0 Python 3
I tested the same code under python2, it does not have such issue, only for python 3.
@gzapatas His method can solve the leaking problem! Thanks…
Run
sudo apt-get install libblas-dev python-dev
Take a look at Theano official website, it requires “BLAS” installation: http://deeplearning.net/software/theano/install_ubuntu.htmlIf you run the program on GPU, other packages are also highly recommended:
libgpuarray
pycuda and skcuda
The leak was masted in the master of Theano. I would recommand to link to a good blas. This will give you speed up at the same time.
Otherwise, update Theano to the dev version.
I can confirm the fix mentioned by @nouiz . Properly linking to MKL solved the problem. http://deeplearning.net/software/theano/troubleshooting.html#test-blas
There is a pr to fix this in Theano.
The problem only happen if Theano can’t link directly to BLAS. One work around that should also speed up computation is to install a good library that Theo can reuse.
Le lun. 10 avr. 2017 10:05, hft7h11 notifications@github.com a écrit :
I got the same issue. Keras 2.0.2 on Windows with Theano backend. The memory consumption keeps increasing and finally the program crashed.
Hi @HareeshBahuleyan, I use your model as target to monitor memory usage. Script:
Result shows that no obvious memory leak found. Can you add the callback to monitor memory usage on your own data?