tensorflow: High memory consumption with model.fit in TF 2.0.0 and 2.1.0-rc0
System information
- Have I written custom code: Yes
- OS Platform and Distribution: Linux Kubuntu 18.04, kernel 5.0
- Mobile device: Not verified on mobile devices
- TensorFlow installed from: binary via
pip install tensorflow-gpu - TensorFlow version:
2.1.0-rc0, however affected are also2.0.0and2.0.0-rc0,2.0.0-rc1,2.0.0-rc2 - Python version: 3.6.9
- CUDA version: 10.1 for TF 2.1.0-rc0; 10.0 for the earlier versions of TF
- cuDNN version: 7
- GPU model and memory: Nvidia GeForce GTX 1050 Ti (4GB)
- CPU model: AMD Ryzen 7 1700
Describe the current behavior
Model training with the Keras API consumes high amount of system memory with TF 2.0.0 and 2.1.0-rc0, as well as in 2.0.0-rc0, 2.0.0-rc1 and 2.0.0-rc2. It looks like the memory used by model.fit is proportional to the size of the training data provided as numpy arrays, with the proportionality constant being approximately 1. In other words, if the numpy arrays x and y are, say, 8 GB in total, then model.fit(x,y,...) will use another 8 GB (plus some overhead). This may suggest that model.fit creates unnecessary copies of the data arrays. This is in contrary to TF 1.14.0, 2.0.0-a0, 2.0.0-b0 and 2.0.0-b1, where model.fit seems to use some amount of RAM independent of the data size (and much less than 8 GB, at least in the test code attached below).
The same concerns the validation data. If validation data are passed as numpy arrays to model.fit via the argument validation_data, then the memory use of model.fit seems to duplicate the size of the validation data arrays with TF from 2.0.0-rc0 to 2.1.0-rc0.
In the code attached below, one may change the variable K to vary the size of the data and test the above described behaviour. It is straightforward to estimate the data size: e.g. with K=5000 the data arrays in the below code should be ca. 7.32 GB in total. The whole Python process associated with this code uses approximately this much RAM plus some overhead when running with TF 1.14.0, 2.0.0-a0, 2.0.0-b0 or 2.0.0-b1. But with TF from 2.0.0-rc0 to 2.1.0-rc0 the Python process consumes twice that much RAM. One may comment out the line containing model.fit to check that it is the point at which the high memory consumption starts.
Describe the expected behavior
The size of the memory used by model.fit should not duplicate the size of the training and validation data passed as numpy arrays. It should be more or less independent of the size of the data arrays, similarly as in TF 1.14.0 and in the pre-releases 2.0.0-a0, 2.0.0-b0 and 2.0.0-b1.
Code to reproduce the issue
import tensorflow as tf
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Lambda, Conv2D
print("Tensorflow version: {}".format(tf.__version__),flush=True)
K = 5000 # Number of images
N = 512 # Image size
MAX_SIGNAL = 5000 # The values of the training data range from 0 to this
def build_model():
'''Create a simple test model.'''
inputs = Input((N,N,1))
s = Lambda(lambda x: x / MAX_SIGNAL) (inputs)
s = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(s)
outputs = s
return Model(inputs=[inputs], outputs=[outputs])
# Generate some random data
x_train = np.random.randint(MAX_SIGNAL+1,size=(K,N,N,1),dtype=np.uint16) # Should be 2 560 000 kB
y_train = np.random.randint(1+1 ,size=(K,N,N,1),dtype=np.bool) # Should be 1 280 000 kB
x_val = np.random.randint(MAX_SIGNAL+1,size=(K,N,N,1),dtype=np.uint16) # Should be 2 560 000 kB
y_val = np.random.randint(1+1 ,size=(K,N,N,1),dtype=np.bool) # Should be 1 280 000 kB
# In total, the above arrays should be 7 680 000 kB
model = build_model()
optimizer = tf.keras.optimizers.Adam()
loss = tf.keras.losses.BinaryCrossentropy()
model.compile(optimizer=optimizer, loss=loss)
model.fit(x=x_train, y=y_train, validation_data=(x_val,y_val), batch_size=8, epochs=10)
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 17
- Comments: 44 (8 by maintainers)
Commits related to this issue
- trying to work around a bug https://github.com/tensorflow/tensorflow/issues/35030 — committed to lidless-vision/keras-residual-vae-tf2.3 by cameronbergh 4 years ago
An update on the problem: Lambda layer has no memory usage issue, and it looks like memory_profiler has a wrong mapping between line # and memory usage/increment. When inserting
pdb.set_trace()beforemodel.fit(),htop(ortop) shows no memory increase aftermodel = build_model(), but it increases dramatically inmodel.fit:The root cause is from the way TF handles how to convert numpy array to tensors. If you run the following code:
You will see an increase of 2.5GB on the memory usage (which is almost the dataset size). Even for
yy = tf.convert_to_tensor(x_train), the memory usage is increased too.We are working on the fix. Please stay tuned.
I have the same issue here and found out that it happens when we provide validation data. if you remove validation data from fit() then memory usage will remain constant and without any increaing. I also noticed that if we provide validation data, after each epoch, memory usage goes up, but after 10 epochs a portion of it gets cleared and again it will go up until the next 10th epoch. these steps will go on until ram is full and training crashes. I should also mention I am using tf2.2
Not exactly. The issue is with the system RAM, not with GPU RAM.
I can see that in both of your gists you get similar usage of system RAM: 9.8 GB for
2.0.0-beta1and 10.2 GB for2.1.0-rc1. Yes, this is the expected behavior. Said that, I cannot reproduce this expected behavior on my workstation. I obtain 9.6 GB for2.0.0-beta1and 18.7 GB for2.1.0-rc1. The latter is twice that much as the former.I have just discovered that TF
2.1.0-rc1was released today. Unfortunately, the issue persists in2.1.0-rc1.I am also experiencing continually increasing memory usage with tensorflow 2.1.0 and 2.0.0 I’m not using Lambda layers - just Dense and Dropout layers. Adding calls to del model, tf.keras.backend.clear_session() and gc.collect() slows the rate of growth, but it still grows endlessly.
I’m using Python 3.6.9 on Linux Mint 19.1 Like others, I’ve been unable to reproduce the issue in colab.
I’ve been able to reproduce the issue locally with the following packages: tensorflow-2.1.0 tensorflow-cpu-2.1.0 tensorflow-gpu-2.1.0 tensorflow-2.0.0 tensorflow-1.15.0
This package gives stable memory usage: tensorflow-cpu-1.15.0
Here’s the code I used to test:
Excuse me for the late answer. I’m back to the topic now.
First, let me acknowledge your effort in digging into the issue which I appreciate. But unfortunately, I cannot confirm the issue is fixed.
To reiterate briefly, the original problem was as follows. It would be reasonable to expect that the memory usage by the test code for this issue is ca. the size of the data arrays (plus some overhead). But in TF 2.0 and 2.1 the memory usage was twice the size of the data arrays (plus overhead). Hence this issue.
The current state is as follows:
In TF 2.2 things indeed have changed albeit it’s hard to call it fixed. Now, with the test code posted in this issue, I have a memory leak about 0.5 x data size in each epoch. In other words, if the size of the data arrays is ca. 8 GB, then the memory usage increases ca. 4 GB each epoch.
If I wrapp the numpy data arrays in datasets:
then the behavior is the same in TF 2.2 as in 2.1 and 2.0, namely the memory usage is twice the data size plus overhead.
Can you confirm these results, @yhliang2018, @mihaimaruseac?
Coming late to the issue.
That is expected. 1.15 and 2.1 and later will have a single pip package for both CPU and GPU builds. If you want only the cpu pip, you should install
tensorflow-cpu. This is documented in the release notesI’m going to try some bisection on nightly, provided I can reproduce the issue
No problem. I have just reinstalled with
pip install --no-cache-dir tensorflow-gpu==2.1.0-rc1. The issue is still present. I have also tried tf-nightly (version2.1.0-dev20191219) and the results are the same as in2.1.0-rc1.The issue apparently is system-dependent. So, to exclude a few possibilities I have also run tests with some non-GPU versions of Tensorflow (
1.14.0,2.0.0-b1,2.0.0-rc0and2.0.0). My earlier results concerning RAM usage has been reproduced: for1.14.0and2.0.0-b1I have observed the expected behavior while for2.0.0-rc0and2.0.0the RAM usage was two times higher.Concerning
2.1.0-rc1, it turned out that the allegedly non-GPU pip package (tensorflow==2.1.0-rc1) is actually distributed with GPU support (similarly totensorflow-gpu==2.1.0-rc1). So for this version, I have manually switched off the GPU support by adding the following lines:Result: the memory usage for
2.1.0-rc1with GPU disabled by this line is as with enabled GPU.Please let me know what further diagnostic info could be helpful for you. I will try to provide all necessary details.