tensorflow: Forked tf script deadlocks unless disabling intra op parallelism
System information
- Have I written custom code: Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04.2 LTS
- Mobile device: No
- TensorFlow installed from: pip
- TensorFlow version (use command below): v2.6.0-rc2-32-g919f693420e 2.6.0
- Python version: 3.9.6
- Bazel/GCC/Compiler version: Not compiling from source
- CUDA/cuDNN/GPU version: Using CPU only
Describe the current behavior
Me and the R&D team of our company are trying do a submission for an standardized benchmark, namely the NIST FRVT 1:1 verification. Their benchmark suite employs this architecture:
- They run a custom submitter
initializefunction, where one can prepare the environment an run expensive functions that loads models and prepare temporary structures; - They fork the process with unix
fork()a number of times and they run either acreate_templateormatchfunction in the forked children. According to the rules of the benchmark, the elaboration must be performed in CPU only. No GPU elaboration is allowed.
Since we can’t modify the architecture because it’s not in our control, we are trying to fit tensorflow so it will work according to these rules, but we are finding the children processes to deadlock trying to elaborate some layers. The only workaround we found is setting intra op parallelism to 1 with tf.config.threading.set_intra_op_parallelism_threads(1), which appears to disable parallelism for operations like matrix multiplications. This workaround will not work in all scenarios, though. Running the model loading in the forked children will also workaround the issue but will penalize us in the benchmark since the time needed for the loading will be accounted for the elaboration.
Describe the expected behavior Since we are enforcing CPU only elaboration and there are no resources that require exclusive access we are expecting tensorflow to correctly fit in this architecture and be able to run correctly in forked processes.
Standalone code to reproduce the issue
The following minimal python script will mimic the architecture and reproduce the issue. Model and test image used are linked.
It’s available also as a Colab notebook, which features a very similar behavior as running in a local machine, with the difference that in Colab notebook the os.waitpid() call never works but that could be an environment limitation.
import os
import tensorflow as tf
import cv2
import numpy
tf.config.set_visible_devices(tf.config.list_physical_devices('CPU'))
#tf.config.threading.set_intra_op_parallelism_threads(1) # Decomment this line and the child will not deadlock
#tf.debugging.set_log_device_placement(True) # Decomment to see job placements
model = None
def initialize():
global model
model = tf.keras.models.load_model('cats_vs_dogs_model_86_83.h5')
print('Model Loaded')
def child():
print('Child spawned')
imageSize = 128
testImage = cv2.resize(src=cv2.imread('cat.jpg'), dsize=(imageSize, imageSize), interpolation=cv2.INTER_LINEAR) / 255
result = model(testImage.reshape(-1, imageSize, imageSize, 3))[0]
print('Result: Cat: ' + str(result[0]) + '| Dog: ' + str(result[1]))
print('Child finished')
def parent():
initialize()
newpid = os.fork()
if newpid == 0:
child()
else:
pids = (os.getpid(), newpid)
print("Parent: %d, Waiting for child: %d\n" % pids)
os.waitpid(newpid, 0)
parent()
Other info / logs Enabling log device placement will reveal that the elaboration stops elaborating in a convolution layer but with other scenarios the elaboration can deadlock in different operations.
Child spawned
2021-09-04 12:10:03.207086: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:CPU:0
2021-09-04 12:10:03.209115: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op Cast in device /job:localhost/replica:0/task:0/device:CPU:0
2021-09-04 12:10:03.210155: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:CPU:0
2021-09-04 12:10:03.211392: I tensorflow/core/common_runtime/eager/execute.cc:1161] Executing op Conv2D in device /job:localhost/replica:0/task:0/device:CPU:0
We are using Numpy version: 1.19.5 as installed automatically from pip. We also tried to enforce updated numpy 1.21.2 as suggested here in an other fork() related issue but that didn’t help.
- Do you want to contribute a PR? (yes/no): No
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15 (6 by maintainers)
@ceztko, sorry for the late response.
I may have a work-around if you’re still interested. The basic idea is (a) disable use of a global threadpool using an environment variable and (b) use an internal API to reset the context:
This should cause most of the threadpools to re-initialize in the child. It works with your example [colab], and hopefully it will work in a real use-case as well.