onnx-mlir: Segmentation fault when a model was compiled with -O0, -O1

I got segmentation fault with many models if they were compiled using -O0 or -O1, e.g. with resnet50:

ONNX_MLIR_HOME=/home/tungld/dl/onnx-mlir/build/Debug python ../utils/RunONNXModel.py resnet50.onnx --compile_args="-mcpu=z14 -O0"
Temporary directory has been created at /tmp/tmpjf_k8lqx
Generating random inputs ...
  - 1st input's shape (1, 3, 224, 224)
  done.

Compiling the model ...
Shared library /tmp/tmpjf_k8lqx/model.so has been compiled.
  took  57.782129378058016  seconds.

Running inference ...
Segmentation fault (core dumped)

There was no issue when using -O2 or -O3.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 24

Commits related to this issue

Most upvoted comments

Verified that resnet50-v1-7.onnx can now compile and run successfully at -O0, -O1, -O2, and -O3 on LoP. @tungld Please reopen if there are other models that still need to be looked at.

we can hoist allocation out of loops with --buffer-hoisting

@tungld Thanks for the suggestion! I found a way to avoid the problem by changing how the alloca instructions are created in the first place.

I have narrowed down the runtime problem at -O0 to an alloc and its users inside a multiple levels nested loop nest, which can cause segfault when run out of stack space. At -O1 and above, InstCombine in opt is able to remove such alloc and all their users, as they are actually dead code. Screen Shot 2022-04-28 at 11 07 02 AM (When I move the alloca to the beginning of the function, the model can run successfully.)