tensorflow: tf.keras predict stuck with Sequence when using multi-processing

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.2
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 1.13.1
  • Python version: 3.6.8
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 10.0
  • GPU model and memory: TITAN

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

Hi,

When using tf.keras with a custom Sequence, the program hangs during predict (with multi-processing). I was able to reproduce the issue with a simple NN that contains a single Dense layer. This happens after setting the weights of the layer and running predict with multi-processing. When commenting the ‘set_weights’ line or running with multi-threading, the program does not hang. Issue exists also in 1.14.0-rc0, Same code works OK with tensorflow 1.12.0 and 2.0.0a0.

Code to reproduce the issue

import numpy as np
from tensorflow import keras

INPUT_SIZE = 3
DENSE_OUTPUTS = 2
NUM_OF_SAMPLES = 1000
BATCH_SIZE = 2
NUM_OF_BATCHES = 5


class DummySequence(keras.utils.Sequence):

    def __len__(self):
        return NUM_OF_SAMPLES // BATCH_SIZE

    def __getitem__(self, index):
        data = [np.full(shape=(INPUT_SIZE,), fill_value=(index*BATCH_SIZE + i)) for i in range(BATCH_SIZE)]
        labels = [np.full(shape=(DENSE_OUTPUTS,), fill_value=(index*BATCH_SIZE + i))*INPUT_SIZE for i in range(BATCH_SIZE)]
        return np.stack(data), np.stack(labels)



x = keras.layers.Input(shape=(INPUT_SIZE,))
dense_layer = keras.layers.Dense(DENSE_OUTPUTS)
y = dense_layer(x)
model = keras.Model(x, y)

# remove comment in tf 1.12
#model.compile(optimizer="sgd", loss=keras.losses.mean_squared_error)

shapes = [v.shape for v in dense_layer.weights]
dense_layer.set_weights([np.full(shape=shapes[0], fill_value=1.0), np.full(shape=shapes[1], fill_value=0.0)])

seq = DummySequence()

workers = 5
multiprocessing = True
# works with multi-threaing
#multiprocessing = False
print("running predict with multiprocessing: {}".format(multiprocessing))
res = model.predict(seq, workers=workers, use_multiprocessing=multiprocessing, steps=NUM_OF_BATCHES)
print("predict # of results: {}\nresults:\n{}".format(len(res), res))

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 21 (7 by maintainers)

Most upvoted comments

@omalleyt12 I’m facing the exact same issue on v2.1.0.

I plan on giving tf-nightly a try. But will it impact my previously trained models if I upgrade?

I faced the same issue and It seem to resolve it with adding this :

from tensorflow.python.keras import backend as K
import tensorflow as tf

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
K.set_session(session)

It seem to be unrelated problem but it’s works.

Is there any progress on the issue? Thanks

My multi-process case: I can perfectly run the same code on my Windows machine, however, the same code cannot work on Ubuntu. It seems like it is caused by the differences in creating new processes between Windows and Linux. In Linux, fork() is called to create a new process by default, and you can manually change to spawn(), which will create a new process from zero instead of sharing some environmental variables between processes.

You can try adding the following codes to the head, which will force Linux to use the spawn method:

import multiprocess.context as ctx
ctx._force_start_method('spawn')

Reference: (https://stackoverflow.com/questions/40615795/pathos-enforce-spawning-on-linux)