onnxruntime: Segfault in CApiTest.test_custom_op_library on linux x86_64, CUDA 11.8,
Describe the issue
The mentioned test segfaults after successful execution, after hanging for some minutes. Seems like it’s segfaulting in the exit handler of gtest. The rest of the tests run fine, so I’m not sure what could be causing this. Could re-compile and run the test with debug symbols if required. Re-compiling libstdc++ with debug symbols would be more difficult and preferably avoided.
[ RUN ] CApiTest.test_custom_op_library
Running inference using custom op shared library
Running simple inference with cuda provider
2024-03-04 22:02:21.457378304 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 1 Memcpy nodes are added to the graph CustomOpTest for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-03-04 22:02:21.457482983 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-03-04 22:02:21.457512237 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
[ OK ] CApiTest.test_custom_op_library (800 ms)
[----------] 1 test from CApiTest (800 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (800 ms total)
[ PASSED ] 1 test.
Segmentation fault (core dumped)
Here’s a backtrace:
Thread 1 "onnxruntime_sha" received signal SIGSEGV, Segmentation fault.
0x00007f60e1bf9ea0 in ?? ()
(gdb) bt
#0 0x00007f60e1bf9ea0 in ?? ()
#1 0x00007f60e439d896 in (anonymous namespace)::run (p=<optimized out>) at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libstdc++-v3/libsupc++/atexit_thread.cc:80
#2 (anonymous namespace)::run () at /opt/conda/conda-bld/gcc-compiler_1654084175708/work/gcc/libstdc++-v3/libsupc++/atexit_thread.cc:105
#3 0x00007f60e3f59ce9 in __run_exit_handlers () from /lib64/libc.so.6
#4 0x00007f60e3f59d37 in exit () from /lib64/libc.so.6
#5 0x00007f60e3f4255c in __libc_start_main () from /lib64/libc.so.6
#6 0x0000000000431c51 in _start ()
Running from a conda environment, with gcc version 11.2 and same version of related libraries (e.g. libstdc++).
To reproduce
Compile onnxruntime and run unit tests using CMake and build.py args defined here on a machine with Nvidia GPU and driver spec as given below
(base) [root@3efd49791a0a /]# nvidia-smi
Mon Mar 4 22:58:06 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla M60 Off | 00000000:00:1E.0 Off | 328 |
| N/A 34C P0 39W / 150W | 121MiB / 7680MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Urgency
No response
Platform
Linux
OS Version
Amazon Linux 2023
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.17.0
ONNX Runtime API
C
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.8
About this issue
- Original URL
- State: closed
- Created 4 months ago
- Comments: 17 (6 by maintainers)
I saved part of it. build_log.zip
I managed to fix the error by swapping the libstdc++ in the conda environment with the system one. Both binaries were named libstdc++.so.6.0.29
The conda libstdc++ is from the package
I’m also going to compile outside conda and see what happens
Good suggestion. ldd should help. Additionally, the test binaries itself should also be built with symbols.