tensorflow: Forked tf script deadlocks unless disabling intra op parallelism

System information

Have I written custom code: Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04.2 LTS
Mobile device: No
TensorFlow installed from: pip
TensorFlow version (use command below): v2.6.0-rc2-32-g919f693420e 2.6.0
Python version: 3.9.6
Bazel/GCC/Compiler version: Not compiling from source
CUDA/cuDNN/GPU version: Using CPU only

Describe the current behavior

Me and the R&D team of our company are trying do a submission for an standardized benchmark, namely the NIST FRVT 1:1 verification. Their benchmark suite employs this architecture:

They run a custom submitter initialize function, where one can prepare the environment an run expensive functions that loads models and prepare temporary structures;
They fork the process with unix fork() a number of times and they run either a create_template or match function in the forked children. According to the rules of the benchmark, the elaboration must be performed in CPU only. No GPU elaboration is allowed.

Since we can’t modify the architecture because it’s not in our control, we are trying to fit tensorflow so it will work according to these rules, but we are finding the children processes to deadlock trying to elaborate some layers. The only workaround we found is setting intra op parallelism to 1 with tf.config.threading.set_intra_op_parallelism_threads(1), which appears to disable parallelism for operations like matrix multiplications. This workaround will not work in all scenarios, though. Running the model loading in the forked children will also workaround the issue but will penalize us in the benchmark since the time needed for the loading will be accounted for the elaboration.

Describe the expected behavior Since we are enforcing CPU only elaboration and there are no resources that require exclusive access we are expecting tensorflow to correctly fit in this architecture and be able to run correctly in forked processes.

Standalone code to reproduce the issue The following minimal python script will mimic the architecture and reproduce the issue. Model and test image used are linked. It’s available also as a Colab notebook, which features a very similar behavior as running in a local machine, with the difference that in Colab notebook the os.waitpid() call never works but that could be an environment limitation.

import os
import tensorflow as tf
import cv2
import numpy

tf.config.set_visible_devices(tf.config.list_physical_devices('CPU'))
#tf.config.threading.set_intra_op_parallelism_threads(1) # Decomment this line and the child will not deadlock
#tf.debugging.set_log_device_placement(True)             # Decomment to see job placements

model = None

def initialize():
    global model
    model = tf.keras.models.load_model('cats_vs_dogs_model_86_83.h5')
    print('Model Loaded')

def child():
    print('Child spawned')
    imageSize = 128
    testImage = cv2.resize(src=cv2.imread('cat.jpg'), dsize=(imageSize, imageSize), interpolation=cv2.INTER_LINEAR) / 255
    result = model(testImage.reshape(-1, imageSize, imageSize, 3))[0]
    print('Result: Cat: ' + str(result[0]) + '| Dog: ' + str(result[1]))
    print('Child finished')

def parent():
    initialize()
    newpid = os.fork()
    if newpid == 0:
        child()
    else:
        pids = (os.getpid(), newpid)
        print("Parent: %d, Waiting for child: %d\n" % pids)
        os.waitpid(newpid, 0)

parent()

Other info / logs Enabling log device placement will reveal that the elaboration stops elaborating in a convolution layer but with other scenarios the elaboration can deadlock in different operations.

Child spawned
2021-09-04 12:10:03.207086: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
2021-09-04 12:10:03.209115: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op Cast in device /job:localhost/replica:0/task:0/device:CPU:0
2021-09-04 12:10:03.210155: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:CPU:0
2021-09-04 12:10:03.211392: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op Conv2D in device /job:localhost/replica:0/task:0/device:CPU:0

We are using Numpy version: 1.19.5 as installed automatically from pip. We also tried to enforce updated numpy 1.21.2 as suggested here in an other fork() related issue but that didn’t help.

Contributing

Do you want to contribute a PR? (yes/no): No

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 15 (6 by maintainers)

Most upvoted comments

@ceztko, sorry for the late response.

I may have a work-around if you’re still interested. The basic idea is (a) disable use of a global threadpool using an environment variable and (b) use an internal API to reset the context:

# Import internal context API.
from tensorflow.python.eager import context

# Disable global threadpool re-use.
os.environ["TF_OVERRIDE_GLOBAL_THREADPOOL"] = "1"
#...

def child():
  # Reset context.
  context._reset_context()
  tf.config.threading.set_inter_op_parallelism_threads(1)
  tf.config.threading.set_intra_op_parallelism_threads(1)
  # ...

This should cause most of the threadpools to re-initialize in the child. It works with your example [colab], and hopefully it will work in a real use-case as well.

cantonios on Jun 17, 2022