onnxruntime: Segfault in CApiTest.test_custom_op_library on linux x86_64, CUDA 11.8,

Describe the issue

The mentioned test segfaults after successful execution, after hanging for some minutes. Seems like it’s segfaulting in the exit handler of gtest. The rest of the tests run fine, so I’m not sure what could be causing this. Could re-compile and run the test with debug symbols if required. Re-compiling libstdc++ with debug symbols would be more difficult and preferably avoided.

[ RUN      ] CApiTest.test_custom_op_library
Running inference using custom op shared library
Running simple inference with cuda provider
2024-03-04 22:02:21.457378304 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 1 Memcpy nodes are added to the graph CustomOpTest for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-03-04 22:02:21.457482983 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-03-04 22:02:21.457512237 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
[       OK ] CApiTest.test_custom_op_library (800 ms)
[----------] 1 test from CApiTest (800 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (800 ms total)
[  PASSED  ] 1 test.
Segmentation fault (core dumped)

Here’s a backtrace:

Thread 1 "onnxruntime_sha" received signal SIGSEGV, Segmentation fault.
0x00007f60e1bf9ea0 in ?? ()
(gdb) bt
#0  0x00007f60e1bf9ea0 in ?? ()
#1  0x00007f60e439d896 in (anonymous namespace)::run (p=<optimized out>) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libstdc++-v3/libsupc++/atexit_thread.cc:80
#2  (anonymous namespace)::run () at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libstdc++-v3/libsupc++/atexit_thread.cc:105
#3  0x00007f60e3f59ce9 in __run_exit_handlers () from /lib64/libc.so.6
#4  0x00007f60e3f59d37 in exit () from /lib64/libc.so.6
#5  0x00007f60e3f4255c in __libc_start_main () from /lib64/libc.so.6
#6  0x0000000000431c51 in _start ()

Running from a conda environment, with gcc version 11.2 and same version of related libraries (e.g. libstdc++).

To reproduce

Compile onnxruntime and run unit tests using CMake and build.py args defined here on a machine with Nvidia GPU and driver spec as given below

(base) [root@3efd49791a0a /]# nvidia-smi
Mon Mar  4 22:58:06 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M60           Off  | 00000000:00:1E.0 Off |                  328 |
| N/A   34C    P0    39W / 150W |    121MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Urgency

No response

Platform

Linux

OS Version

Amazon Linux 2023

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.17.0

ONNX Runtime API

C

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.8

About this issue

  • Original URL
  • State: closed
  • Created 4 months ago
  • Comments: 17 (6 by maintainers)

Most upvoted comments

I saved part of it. build_log.zip

I managed to fix the error by swapping the libstdc++ in the conda environment with the system one. Both binaries were named libstdc++.so.6.0.29

The conda libstdc++ is from the package

libstdcxx-ng:                         11.2.0-h1234567_1

I’m also going to compile outside conda and see what happens

Good suggestion. ldd should help. Additionally, the test binaries itself should also be built with symbols.