onnxruntime: `InferenceSession` initialization hangs
Describe the bug I use ONNX to release/distribute production models for modified base detection from Oxford Nanopore sequencing data in the Remora repository. A user has reported an issue where onnxruntime hangs indefinitely when initializing an inference session from one of these released models.
Urgency As soon as possible, as these models are currently in production.
System information @mattlooss may be able to provide more information here.
To Reproduce See the details of the issue in this thread (https://github.com/nanoporetech/bonito/issues/216), but the issue can be reproduced with the following snippet (after downloading this model:
import onnxruntime as ort
ort.set_default_logger_severity(0)
model = ort.InferenceSession('modbase_model.onnx', providers=['CPUExecutionProvider'])
Upon running the above snippet the reporting user sees the following message followed by the code stalling without completion.
2022-01-03 19:53:44.686663926 [I:onnxruntime:, inference_session.cc:273 operator()] Flush-to-zero and denormal-as-zero are off
2022-01-03 19:53:44.686776610 [I:onnxruntime:, inference_session.cc:280 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true
Expected behavior Model loaded and code able to continue execution.
Screenshots Not applicable.
Additional context Not applicable.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 26 (5 by maintainers)
while using docker(ubuntu16) on kubernetes may cause this problem. but the strange thing is that different node machine have different behavior, some would get stuck but some wouldn’t.
using following setting solves the problem so.inter_op_num_threads = 1 so.intra_op_num_threads = 1
I’ve run a couple of quick tests and it seems that setting the
intr[ea]_op_num_threadsvalues to 1 actually increases the speed of remora in most settings tried thus far. When running megalodon with a single process 1 thread is a marginally slower than 0, but otherwise (several bonito settings and other megalodon settings) the 1 thread setting is either no different or faster than the 0 threads setting. I will get this into remora and try to get a new release pushed shortly.I think it makes sense to leave this issue open since this issue remains, but when the new remora code is pushed I will close the bonito/remora issue.
I can confirm that any of those parameters work (and the model loads).
So to confirm
so.inter_op_num_threads = 0 so.intra_op_num_threads = 0
Does not work.
All other permutations DO work.
Looking forward to trying this out soon.
@benbfly this is very helpful thanks - it points to onnxruntime/core/platform/posix/env.cc#L180 as the source of the bad
sched_setaffinitysyscall and this looks to be the same issues as https://github.com/microsoft/onnxruntime/issues/8313.@mattloose @benbfly https://github.com/microsoft/onnxruntime/issues/8313 suggests setting
inter_op_num_threadsandintra_op_num_threadsto1. Can you try:@pranavsharma is there any extra information we can provide to help with this?
I get the same result. When I change it to anything but 0/0, it completes normally (I tried 1/1, 2/1, and 1/2). When I use 0/0, it either hangs or gives that “pthread_setaffinity_np failed” error and segfaults.
@marcus1487 hopefully this can be implemented in Remora without reducing the efficiency. In the #8313 issue, they said it got slower when they set this equal to the number of CPU cores (but my understanding of this issue is quite limited): https://github.com/microsoft/onnxruntime/issues/8313#issuecomment-876885818