onnxruntime: onnxruntime.InferenceSession hangs

Describe the issue

Hi! I have exported an onnx model from pytorch and I am trying to use it for inference, but the onnxruntime.InferenceSession hangs without any error I have tried this on linux and mac. When using strace, it does not show anything after the last message

The model can be found here: model

To reproduce

import onnxruntime as ort
ort.set_default_logger_severity(0)
so = ort.SessionOptions()
print(so.inter_op_num_threads)
print(so.intra_op_num_threads)
print("starting to load")
ort_session = ort.InferenceSession(
    "model_1.onnx",
    providers=["CPUExecutionProvider"],
)
print("finished loading")

The log gives this:

0
0
starting to load
2024-04-17 14:39:53.167254537 [I:onnxruntime:, inference_session.cc:330 operator()] Flush-to-zero and denormal-as-zero are off
2024-04-17 14:39:53.244762912 [I:onnxruntime:, inference_session.cc:338 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true
2024-04-17 14:39:53.244791816 [I:onnxruntime:, inference_session.cc:356 ConstructorCommon] Dynamic block base set to 0
2024-04-17 14:39:53.245374968 [V:onnxruntime:Default, env.cc:248 ThreadMain] pthread_setaffinity_np succeed for thread: 1325925, index: 0, mask: {1, }
2024-04-17 14:39:53.245475139 [V:onnxruntime:Default, env.cc:248 ThreadMain] pthread_setaffinity_np succeed for thread: 1325927, index: 2, mask: {3, }
2024-04-17 14:39:53.245640304 [V:onnxruntime:Default, env.cc:248 ThreadMain] pthread_setaffinity_np succeed for thread: 1325930, index: 5, mask: {6, }
2024-04-17 14:39:53.257577164 [V:onnxruntime:Default, env.cc:248 ThreadMain] pthread_setaffinity_np succeed for thread: 1325926, index: 1, mask: {2, }
2024-04-17 14:39:53.257664768 [V:onnxruntime:Default, env.cc:248 ThreadMain] pthread_setaffinity_np succeed for thread: 1325928, index: 3, mask: {4, }
2024-04-17 14:39:53.257809630 [V:onnxruntime:Default, env.cc:248 ThreadMain] pthread_setaffinity_np succeed for thread: 1325929, index: 4, mask: {5, }
2024-04-17 14:39:53.260755085 [V:onnxruntime:Default, env.cc:248 ThreadMain] pthread_setaffinity_np succeed for thread: 1325932, index: 7, mask: {8, }
2024-04-17 14:39:53.260733432 [V:onnxruntime:Default, env.cc:248 ThreadMain] pthread_setaffinity_np succeed for thread: 1325931, index: 6, mask: {7, }
2024-04-17 14:39:53.264674323 [V:onnxruntime:Default, env.cc:248 ThreadMain] pthread_setaffinity_np succeed for thread: 1325933, index: 8, mask: {9, }
2024-04-17 14:40:02.464322875 [I:onnxruntime:, inference_session.cc:1402 Initialize] Initializing session.
2024-04-17 14:40:02.464374810 [I:onnxruntime:Default, bfc_arena.cc:29 BFCArena] Creating BFCArena for Cpu with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 max_power_of_two_extend_bytes: 1073741824 memory limit: 18446744073709551615 arena_extend_strategy: 0
2024-04-17 14:40:02.495584676 [V:onnxruntime:Default, bfc_arena.cc:66 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2024-04-17 14:40:08.501566408 [I:onnxruntime:, constant_sharing.cc:256 ApplyImpl] Total shared scalar initializer count: 7799
2024-04-17 14:40:23.650234892 [I:onnxruntime:, graph.cc:3556 CleanUnusedInitializersAndNodeArgs] Removing initializer '_val_149'. It is no longer used by any node.
2024-04-17 14:40:23.650378373 [I:onnxruntime:, graph.cc:3556 CleanUnusedInitializersAndNodeArgs] Removing initializer '_val_142'. It is no longer used by any node.
2024-04-17 14:40:23.650385274 [I:onnxruntime:, graph.cc:3556 CleanUnusedInitializersAndNodeArgs] Removing initializer '_val_139'. It is no longer used by any node.
2024-04-17 14:40:23.650391115 [I:onnxruntime:, graph.cc:3556 CleanUnusedInitializersAndNodeArgs] Removing initializer '_val_137'. It is no longer used by any node.
2024-04-17 14:40:23.650403024 [I:onnxruntime:, graph.cc:3556 CleanUnusedInitializersAndNodeArgs] Removing initializer '_val_144'. It is no longer used by any node.
2024-04-17 14:40:23.650410618 [I:onnxruntime:, graph.cc:3556 CleanUnusedInitializersAndNodeArgs] Removing initializer '_val_146'. It is no longer used by any node.
2024-04-17 14:40:23.650428728 [I:onnxruntime:, graph.cc:3556 CleanUnusedInitializersAndNodeArgs] Removing initializer 'ortshared_7_0_1_7'. It is no longer used by any node.
2024-04-17 14:40:23.650434231 [I:onnxruntime:, graph.cc:3556 CleanUnusedInitializersAndNodeArgs] Removing initializer 'ortshared_7_0_1_6'. It is no longer used by any node.
2024-04-17 14:40:23.650441269 [I:onnxruntime:, graph.cc:3556 CleanUnusedInitializersAndNodeArgs] Removing initializer 'ortshared_7_0_1_4'. It is no longer used by any node.
2024-04-17 14:40:23.650446846 [I:onnxruntime:, graph.cc:3556 CleanUnusedInitializersAndNodeArgs] Removing initializer 'ortshared_7_0_1_3'. It is no longer used by any node.
2024-04-17 14:40:23.650456481 [I:onnxruntime:, graph.cc:3556 CleanUnusedInitializersAndNodeArgs] Removing initializer 'ortshared_7_1_1_1'. It is no longer used by any node.
2024-04-17 14:40:23.650463897 [I:onnxruntime:, graph.cc:3556 CleanUnusedInitializersAndNodeArgs] Removing initializer 'ortshared_7_0_1_1'. It is no longer used by any node.
2024-04-17 14:40:23.650473228 [I:onnxruntime:, graph.cc:3556 CleanUnusedInitializersAndNodeArgs] Removing initializer 'ortshared_7_0_1_0'. It is no longer used by any node.
2024-04-17 14:40:23.652591674 [I:onnxruntime:, graph.cc:3556 CleanUnusedInitializersAndNodeArgs] Removing initializer '_val_148'. It is no longer used by any node.
2024-04-17 14:40:23.654060774 [I:onnxruntime:, graph.cc:3556 CleanUnusedInitializersAndNodeArgs] Removing initializer 'ortshared_7_0_1_5'. It is no longer used by any node.
2024-04-17 14:40:23.654923282 [I:onnxruntime:, graph.cc:3556 CleanUnusedInitializersAndNodeArgs] Removing initializer 'ortshared_7_0_1_2'. It is no longer used by any node.
2024-04-17 14:40:27.459510017 [I:onnxruntime:, constant_sharing.cc:256 ApplyImpl] Total shared scalar initializer count: 2

Urgency

This is a key step to use our model in production

Platform

Linux

OS Version

Linux 9.3

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.3

ONNX Runtime API

Python

Architecture

X86_64

Execution Provider

Default CPU

Execution Provider Library Version

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 months ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments

I have optimized the model and now I can start the inference session and run it. Thank you @yuslepukhin and @justinchuby 😃

Awesome! Curious what was done?

The graph had many constant that were created by the model inside functions, I initialized those with the model instead. Also there were some conversion errors like: x[...,index_list] is not converted well and has to be modified to use torch.index_select. However, operations like einsum do not seem to be dynamic with input shape (this is for a GNN like architecture) so that is problematic.

Hi @justinchuby, I am wondering if this could be related to the intro of new transformations 124160. Do you think it could be the case? (sorry to bother you again)