onnxruntime: `InferenceSession` initialization hangs

Describe the bug I use ONNX to release/distribute production models for modified base detection from Oxford Nanopore sequencing data in the Remora repository. A user has reported an issue where onnxruntime hangs indefinitely when initializing an inference session from one of these released models.

Urgency As soon as possible, as these models are currently in production.

System information @mattlooss may be able to provide more information here.

To Reproduce See the details of the issue in this thread (https://github.com/nanoporetech/bonito/issues/216), but the issue can be reproduced with the following snippet (after downloading this model:

import onnxruntime as ort
ort.set_default_logger_severity(0)
model = ort.InferenceSession('modbase_model.onnx', providers=['CPUExecutionProvider'])

Upon running the above snippet the reporting user sees the following message followed by the code stalling without completion.

2022-01-03 19:53:44.686663926 [I:onnxruntime:, inference_session.cc:273 operator()] Flush-to-zero and denormal-as-zero are off
2022-01-03 19:53:44.686776610 [I:onnxruntime:, inference_session.cc:280 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true

Expected behavior Model loaded and code able to continue execution.

Screenshots Not applicable.

Additional context Not applicable.

About this issue

Original URL
State: open
Created 2 years ago
Comments: 26 (5 by maintainers)

Most upvoted comments

while using docker(ubuntu16) on kubernetes may cause this problem. but the strange thing is that different node machine have different behavior, some would get stuck but some wouldn’t.

using following setting solves the problem so.inter_op_num_threads = 1 so.intra_op_num_threads = 1

ltcs11 on Apr 20, 2022

I’ve run a couple of quick tests and it seems that setting the intr[ea]_op_num_threads values to 1 actually increases the speed of remora in most settings tried thus far. When running megalodon with a single process 1 thread is a marginally slower than 0, but otherwise (several bonito settings and other megalodon settings) the 1 thread setting is either no different or faster than the 0 threads setting. I will get this into remora and try to get a new release pushed shortly.

I think it makes sense to leave this issue open since this issue remains, but when the new remora code is pushed I will close the bonito/remora issue.

marcus1487 on Jan 26, 2022

I can confirm that any of those parameters work (and the model loads).

So to confirm

so.inter_op_num_threads = 0 so.intra_op_num_threads = 0

Does not work.

All other permutations DO work.

Looking forward to trying this out soon.

mattloose on Jan 25, 2022

@benbfly this is very helpful thanks - it points to onnxruntime/core/platform/posix/env.cc#L180 as the source of the bad sched_setaffinity syscall and this looks to be the same issues as https://github.com/microsoft/onnxruntime/issues/8313.

@mattloose @benbfly https://github.com/microsoft/onnxruntime/issues/8313 suggests setting inter_op_num_threads and intra_op_num_threads to 1. Can you try:

import onnxruntime as ort
ort.set_default_logger_severity(0)

so = ort.SessionOptions()

so.inter_op_num_threads = 1
so.intra_op_num_threads = 1

print(so.inter_op_num_threads)
print(so.intra_op_num_threads)

model = ort.InferenceSession('modbase_model.onnx', providers=['CPUExecutionProvider'], sess_options=so)

@pranavsharma is there any extra information we can provide to help with this?

iiSeymour on Jan 25, 2022

I get the same result. When I change it to anything but 0/0, it completes normally (I tried 1/1, 2/1, and 1/2). When I use 0/0, it either hangs or gives that “pthread_setaffinity_np failed” error and segfaults.

@marcus1487 hopefully this can be implemented in Remora without reducing the efficiency. In the #8313 issue, they said it got slower when they set this equal to the number of CPU cores (but my understanding of this issue is quite limited): https://github.com/microsoft/onnxruntime/issues/8313#issuecomment-876885818

benbfly on Jan 25, 2022