tensorflow: Hang on out of memory error
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.3 LTS
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
- TensorFlow installed from (source or
binary): - TensorFlow version (use command below): 2.1.0 and nightly, used
tensorflow/tensorflow:2.1.0-gpu-py3andtensorflow/tensorflow:nightly-gpu-py3 - Python version: - Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version: - GPU model and memory: P100 and V100 Driver: 440.33.01 CUDA 10.1 in container
You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior TensorFlow hangs when it hits out of memory after it dumps the out of memory message.
Describe the expected behavior TensorFlow should exit on non-zero return code on OOM.
Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.
import tensorflow as tf
from tensorflow.keras import backend as K
import numpy as np
def random_image_generator(batch_size, num_classes, input_shape):
templates = 2 * num_classes * np.random.random((num_classes,) + input_shape)
random_data = np.random.normal(loc=0, scale=1., size=input_shape)
while True:
y = np.random.randint(0, num_classes, size=(batch_size,))
x = np.zeros((batch_size,) + input_shape, dtype=np.float32)
for i in range(batch_size):
x[i] = templates[y[i]] + random_data
x_array = np.array(x)
y_array = tf.keras.utils.to_categorical(y, num_classes)
yield(x_array, y_array)
def run_model():
K.set_image_data_format('channels_first')
image_dim = 5000
input_shape = (3, image_dim, image_dim)
num_classes = 15
batch_size = 1
model_class = tf.keras.applications.ResNet50
model = model_class(weights=None, include_top=True, input_shape=input_shape,
classes=num_classes)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
random_generator = random_image_generator(batch_size, num_classes,
input_shape)
model.fit(random_generator, steps_per_epoch=10,
epochs=1)
run_model()
Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
This program hangs after dumping the out of memory error on 16GB and 32GB GPUs (P100 and V100 tested). The program use to exit on TensorFlow 1.15. This happens on both the 2.1.0 and nightly containers on Intel x86 systems.
I originally hit this on built-from-source TensorFlow 2.1.0 on ppc64le. On that system, I attached gdb and dumped the stacks. It seems the code is hanging on the three thread stacks noted in the attachment. threeThreadStacks.txt
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 17 (17 by maintainers)
I don’t think the issue is with
ParallelMapIterator- it was moved between 2.0.0 and 2.1.0, but it’s always had the logic of waiting for outstanding calls to finish during deconstruction: https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/core/kernels/data/parallel_map_iterator.cc#L70-L79From the stack trace
#6 0x00007fff462e9cd4 in tensorflow::condition_variable::wait #7 0x00007fff4114079c in tensorflow::data::InstantiatedCapturedFunction::RunWithBorrowedArgs #8 0x00007fff40d88d3c in tensorflow::data::GeneratorDatasetOp::Dataset::Iterator::GetNextInternal
it looks like we’re getting stuck here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/data/captured_function.cc#L717-L721. We call
lib_->Runto invoke the python function, which is supposed to callNotify()when the python function completes (whether or not it succeeds). For some reason it looks like that callback never happens. It isn’t clear whether that’s because the python function itself never completes, or becauselib_->Runfails to callNotifyon some error-handling code path. If I could reproduce, I would add additional logging to see what happens inlib_->Run