tensorflow: Hang on out of memory error

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.3 LTS
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary): - TensorFlow version (use command below): 2.1.0 and nightly, used tensorflow/tensorflow:2.1.0-gpu-py3 and tensorflow/tensorflow:nightly-gpu-py3
Python version: - Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: - GPU model and memory: P100 and V100 Driver: 440.33.01 CUDA 10.1 in container

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior TensorFlow hangs when it hits out of memory after it dumps the out of memory message.

Describe the expected behavior TensorFlow should exit on non-zero return code on OOM.

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

import tensorflow as tf
from tensorflow.keras import backend as K

import numpy as np


def random_image_generator(batch_size, num_classes, input_shape):
    templates = 2 * num_classes * np.random.random((num_classes,) + input_shape)
    random_data = np.random.normal(loc=0, scale=1., size=input_shape)
    while True:
        y = np.random.randint(0, num_classes, size=(batch_size,))
        x = np.zeros((batch_size,) + input_shape, dtype=np.float32)
        for i in range(batch_size):
            x[i] = templates[y[i]] + random_data
        x_array = np.array(x)
        y_array = tf.keras.utils.to_categorical(y, num_classes)
        yield(x_array, y_array)

def run_model():
    K.set_image_data_format('channels_first')
    image_dim = 5000
    input_shape = (3, image_dim, image_dim)

    num_classes = 15
    batch_size = 1
    model_class = tf.keras.applications.ResNet50
    model = model_class(weights=None, include_top=True, input_shape=input_shape,
                        classes=num_classes)

    model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

    random_generator = random_image_generator(batch_size, num_classes,
                                              input_shape)
    model.fit(random_generator, steps_per_epoch=10,
              epochs=1)

run_model()

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

This program hangs after dumping the out of memory error on 16GB and 32GB GPUs (P100 and V100 tested). The program use to exit on TensorFlow 1.15. This happens on both the 2.1.0 and nightly containers on Intel x86 systems.

I originally hit this on built-from-source TensorFlow 2.1.0 on ppc64le. On that system, I attached gdb and dumped the stacks. It seems the code is hanging on the three thread stacks noted in the attachment. threeThreadStacks.txt

About this issue

Original URL
State: open
Created 4 years ago
Comments: 17 (17 by maintainers)

Most upvoted comments

I don’t think the issue is with ParallelMapIterator - it was moved between 2.0.0 and 2.1.0, but it’s always had the logic of waiting for outstanding calls to finish during deconstruction: https://github.com/tensorflow/tensorflow/blob/v2.0.0/tensorflow/core/kernels/data/parallel_map_iterator.cc#L70-L79

From the stack trace

#6 0x00007fff462e9cd4 in tensorflow::condition_variable::wait #7 0x00007fff4114079c in tensorflow::data::InstantiatedCapturedFunction::RunWithBorrowedArgs #8 0x00007fff40d88d3c in tensorflow::data::GeneratorDatasetOp::Dataset::Iterator::GetNextInternal

it looks like we’re getting stuck here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/data/captured_function.cc#L717-L721. We call lib_->Run to invoke the python function, which is supposed to call Notify() when the python function completes (whether or not it succeeds). For some reason it looks like that callback never happens. It isn’t clear whether that’s because the python function itself never completes, or because lib_->Run fails to call Notify on some error-handling code path. If I could reproduce, I would add additional logging to see what happens in lib_->Run

aaudiber on Mar 2, 2020