keras: High memory consumption with model.fit in TF 2.x

Moved from Tensorflow repository https://github.com/tensorflow/tensorflow/issues/40942

@gdudziuk opened this issue in TF repo.

System information

Have I written custom code: Yes OS Platform and Distribution: CentOS Linux 7 Mobile device: Not verified on mobile devices TensorFlow installed from: binary, via pip install tf-nightly TensorFlow version: 2.5.0-dev20200626 Python version: 3.6.8 CUDA/cuDNN version: 10.1 / 7 GPU model and memory: Tesla V100 32 GB Describe the current behavior

Model training with the Keras API consumes high amount of system memory. It looks like the memory used by model.fit is proportional to the size of the training data provided as numpy arrays, with the proportionality constant being approximately 1. In other words, if the numpy arrays x and y are, say, 8 GB in total, then model.fit(x,y,…) will use another 8 GB (plus some overhead). So the memory usage by model.fit uses is twice the data size (plus some overhead).

The same concerns the validation data. If validation data are passed as numpy arrays to model.fit via the argument validation_data, then the memory use of model.fit seems to duplicate the size of the validation data arrays.

The described effect is also present if I wrap the numpy arrays containing the data in TF Datasets.

In the code attached below, one may change the variable K to vary the size of the data and test the above described behavior. It is straightforward to estimate the data size (e.g. with K=5000 the data arrays in the below code should be ca. 7.32 GB in total). The whole Python process associated with this code uses approximately twice this much RAM plus some overhead independent of the data size. One may comment out the line containing model.fit to check that it is the point at which the high memory consumption starts.

Describe the expected behavior

It would be reasonable to expect that the memory usage by the test code was approximately the data size plus some overhead independent of the data size (not twice the data size plus overhead).

A bit of history

This is a continuation of the issue #35030, concerning TF 2.0 and 2.1. I opened the latter issue in December 2019 and now @karmel have stated that that issue is very long and asked me to test if the issue persists in TF-nightly and open a new issue if necessary. So yes, the problem persists, and here I open a new issue.

The problem appeared first in the release 2.0.0-rc0. In the earlier releases up to 2.0.0-b1 inclusive the memery usage by the below test code was ca. the size of the data arrays plus an overhead independent of the data size. Starting from 2.0.0-rc0 it became twice the data size plus overhead and it was true at least until 2.1.0.

Next, in 2.2.0, the situation changed a bit:

When using numpy arrays to pass data to model.fit, there was a memory leak about 0.5 x data size in each epoch. In other words, if the size of the data arrays was ca. 8 GB, then the memory usage was increasing ca. 4 GB each epoch. When wrapping the data arrays in TF datasets and then passing to model.fit, then the behavior was the same in TF 2.2 as in 2.1 and 2.0, namely the memory usage was twice the data size plus overhead. Now, in the nightly release 2.5.0-dev20200626 we are back to the previous situation, namely the memory usage is twice the data size plus overhead, regardless of whether numpy arrays or datasets are used to pass the data to model.fit.

An important note on reproducibility

The issue has occurred to be not reproducible in colab! In #35030, I reported the issue for my local machine and some other participants also managed to reproduce it locally but not in colab. Some were trying to reproduce it in colab without success. Similarly, the results I report now are not from colab.

Also, for some reason the issue cannot be captured when using libmemusage.so to measure the memory usage. To capture the issue, I use ps au in Linux terminal or Python module psutil.

Standalone code to reproduce the issue

Since this issue is in fact a continuation of #35030, I use the same test code here.

import tensorflow as tf
import numpy as np

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Lambda, Conv2D

print("Tensorflow version: {}".format(tf.__version__),flush=True)

K = 5000 # Number of images
N = 512  # Image size

MAX_SIGNAL = 5000 # The values of the training data range from 0 to this

def build_model():
  '''Create a simple test model.'''
  
  inputs = Input((N,N,1))
  s = Lambda(lambda x: x / MAX_SIGNAL) (inputs)
  s = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(s)
  outputs = s

  return Model(inputs=[inputs], outputs=[outputs])

# Generate some random data
x_train = np.random.randint(MAX_SIGNAL+1,size=(K,N,N,1),dtype=np.uint16) # Should be 2 560 000 kB
y_train = np.random.randint(1+1         ,size=(K,N,N,1),dtype=np.bool)   # Should be 1 280 000 kB
x_val   = np.random.randint(MAX_SIGNAL+1,size=(K,N,N,1),dtype=np.uint16) # Should be 2 560 000 kB
y_val   = np.random.randint(1+1         ,size=(K,N,N,1),dtype=np.bool)   # Should be 1 280 000 kB
# In total, the above arrays should be 7 680 000 kB

model = build_model()

optimizer = tf.keras.optimizers.Adam()
loss = tf.keras.losses.BinaryCrossentropy()

model.compile(optimizer=optimizer, loss=loss)
model.fit(x=x_train, y=y_train, validation_data=(x_val,y_val), batch_size=8, epochs=10)
The above is meant to reproduce the issue with data passed to model.fit as numpy arrays. To test the behavior with TF datasets, replace the last line with the following:

ds_train = tf.data.Dataset.from_tensor_slices((x_train,y_train)).batch(8)
ds_val = tf.data.Dataset.from_tensor_slices((x_val,y_val)).batch(8)
model.fit(ds_train, validation_data=ds_val, epochs=10)

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 25 (4 by maintainers)

Most upvoted comments

Than you very much for doing this. Let me post also the slightly modified test code with built-in memory measurements, which may be more convenient:

import tensorflow as tf
import numpy as np
import psutil
import os

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Lambda, Conv2D
from tensorflow.keras.callbacks import Callback

print("Tensorflow version: {}".format(tf.__version__),flush=True)

K = 5000 # Number of images
N = 512  # Image size

MAX_SIGNAL = 5000 # The values of the training data range from 0 to this

class MemoryUsageCallback(Callback):
  '''Monitor memory usage on epoch begin and end.'''

  def on_epoch_begin(self,epoch,logs=None):
    print('**Epoch {}**'.format(epoch))
    print('Memory usage on epoch begin: {}'.format(psutil.Process(os.getpid()).memory_info().rss))

  def on_epoch_end(self,epoch,logs=None):
    print('Memory usage on epoch end:   {}'.format(psutil.Process(os.getpid()).memory_info().rss))
    
def build_model():
  '''Create a simple test model.'''
  
  inputs = Input((N,N,1))
  s = Lambda(lambda x: x / MAX_SIGNAL) (inputs)
  s = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(s)
  outputs = s

  return Model(inputs=[inputs], outputs=[outputs])

# Generate some random data
x_train = np.random.randint(MAX_SIGNAL+1,size=(K,N,N,1),dtype=np.uint16) # Should be 2 560 000 kB
y_train = np.random.randint(1+1         ,size=(K,N,N,1),dtype=np.bool)   # Should be 1 280 000 kB
x_val   = np.random.randint(MAX_SIGNAL+1,size=(K,N,N,1),dtype=np.uint16) # Should be 2 560 000 kB
y_val   = np.random.randint(1+1         ,size=(K,N,N,1),dtype=np.bool)   # Should be 1 280 000 kB
# In total, the above arrays should be 7 680 000 kB

model = build_model()

callbacks = [MemoryUsageCallback()]
optimizer = tf.keras.optimizers.Adam()
loss = tf.keras.losses.BinaryCrossentropy()

model.compile(optimizer=optimizer, loss=loss)
model.fit(x=x_train, y=y_train, validation_data=(x_val,y_val), batch_size=8, epochs=10, callbacks=callbacks, verbose=0)

The above is meant to reproduce the issue with data passed to model.fit as numpy arrays. To test the behavior with TF datasets, replace the last line with the following:

ds_train = tf.data.Dataset.from_tensor_slices((x_train,y_train)).batch(8)
ds_val = tf.data.Dataset.from_tensor_slices((x_val,y_val)).batch(8)
model.fit(ds_train, validation_data=ds_val, batch_size=8, epochs=10, callbacks=callbacks, verbose=0)

How is that possible that this issue has been qualified as stalled? I have answered @rchao 's questions and was waiting for response.