tensorflow: Memory leak in model.fit

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): minimal working example
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): windows-server 2016
TensorFlow installed from (source or binary): conda
TensorFlow version (use command below): tf 2.1.0
Python version: 3.7
CUDA/cuDNN version: CUDA 10.1
GPU model and memory: K80 - 24Gb

Describe the current behavior memory use increases with consecutive training runs, probably related to #35524, #33030, #35124, #35835 side note: I do not understand the warning but this seems to be handled in #37500 Describe the expected behavior memory should stay constant

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

from tensorflow.keras.datasets import cifar10
import tensorflow.keras.callbacks as callbacks
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import LearningRateScheduler
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.layers import Input, Conv2D, GlobalAveragePooling2D, Activation, Dense
from tensorflow.keras.models import Model
import tensorflow.keras.utils as kutils
import numpy as np
import psutil
import gc 


batch_size = 128
epochs = 5
num_classes = 10


def buildmodel():
    img_input = Input(shape=(32, 32, 3))

    x = Conv2D(16, (3, 3), padding='same')(img_input)
    x = Activation("relu")(x)
    x = Conv2D(16, (3, 3), padding='same')(x)
    x = Activation("relu")(x)
    x = GlobalAveragePooling2D()(x)
    prediction = Dense(num_classes,activation='softmax', name = 'classifier') (x)  
    model = Model(inputs=img_input, outputs=prediction)
    return model

(trainX, trainY), (testX, testY) = cifar10.load_data()
mean = np.mean(trainX, axis=0)
std = np.std(trainX)

trainX = trainX.astype('float32')
trainX = (trainX - mean) / std
testX = testX.astype('float32')
testX = (testX - mean) / std

trainY = kutils.to_categorical(trainY)
testY = kutils.to_categorical(testY)

generator = ImageDataGenerator()
generator.fit(trainX)
val_generator = ImageDataGenerator()

for i in range(10):
    tf.keras.backend.clear_session()
    model = buildmodel()
    sgd = SGD(lr=0.1, momentum=0.9, nesterov=True)
    model.compile(loss="categorical_crossentropy", optimizer=sgd, metrics=["acc"])
    model.fit(generator.flow(trainX, trainY, batch_size=batch_size), epochs=epochs, validation_data=val_generator.flow(testX, testY, batch_size = batch_size),verbose=0, workers = 20)
    print('memory usesd: ' + str(psutil.virtual_memory().used // 1e6))
    gc.collect()

output: WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] memory usesd: 38012.0 WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] memory usesd: 38563.0 WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] memory usesd: 39288.0 WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] memory usesd: 40005.0 WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] memory usesd: 40730.0 WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] memory usesd: 41490.0 WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] memory usesd: 42216.0 WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] memory usesd: 42937.0 WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] memory usesd: 43659.0 WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] WARNING:tensorflow:sample_weight modes were coerced from … to
[‘…’] memory usesd: 44403.0

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 8
Comments: 42 (9 by maintainers)

Most upvoted comments

Hey, I was having the same issue. My current workaround is to save the model before deleting it and clearing the backend after each training iteration. Then I reload the model before calling fit again. I’m no longer experiencing the memory leak with this workaround.

DanyelJei on Apr 12, 2020

@SysuJayce In my code, the memory leakage happened after every iteration of reading images and training one epoch epoch on it. So I appended the code after every iteration. Although this is not a fundamental fix to this problem we experienced, it fixed my issue.
for epoch in range(epochs):
  data,labels=dataset.load_train(...)
  self.model.fit(data,labels)
  tf.keras.backend.clear_session()
  gc.collect()

@krenerd unfortunately, I tried your code but memory leak still exists

SysuJayce on Mar 8, 2021

@jvishnuvardhan Thanks for your help and great work, memory is constant with tf-nightly! As a side note: When I remove either gc.collect() or tf.keras.backend.clear_session() memory is leaking, so both commands are needed for constant memory usage when performing consecutive trainings.

nkoerb on Sep 20, 2020

I found the solution! Not use this piece of shit library written by fucking unprincipled dumbasses and morons. What they can do great is just do fucking plastic smiles, pass difficult useless interviews and lick ass to their bosses from the mose fascist organization of all times. Please die from cancer! Pytorch forever

AndreyStille on Jun 22, 2021

https://fantashit.com/linearly-increasing-memory-with-use-multiprocessing-and-keras-sequence/#comment-254237

atotev on Mar 20, 2021

Can anyone advise how I check since which version this bug is implemented in a stable release? Can’t find any info on the releases page: https://github.com/tensorflow/tensorflow/releases?page=1

shir994 on Mar 10, 2022