onnxruntime: Can't load Cuda Provider on Linux due symbol lookup error

Describe the bug I am trying to load OnnxRuntime library with CUDA provider in C# application but get symbol lookup error:

dotnet: symbol lookup error: /home/egortech/TestOnnx/net5.0/runtimes/linux-x64/native/libonnxruntime_providers_cuda.so: undefined symbol: Provider_GetHost

Urgency If there are particular important use cases blocked by this or strict project-related timelines, please share more information and dates. If there are no hard deadlines, please specify none.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04.3 LTS
ONNX Runtime installed from (source or binary): binary but trying built from source as well
ONNX Runtime version: 1.10
Python version: Not Applicable
Visual Studio version (if applicable): 2019
GCC/Compiler version (if compiling from source): 9.3.0
CUDA/cuDNN version: 11.4/8.2.0.6
GPU model and memory: NVidia GeForce 2070 8GB

To Reproduce I am trying to preload Onnx Runtime Libs in c# with that code:

NativeLibrary.Load(
	"onnxruntime.so",
	Assembly.GetEntryAssembly(),
	DllImportSearchPath.UseDllDirectoryForDependencies);

NativeLibrary.Load(
	"onnxruntime_providers_shared.so",
	Assembly.GetEntryAssembly(),
	DllImportSearchPath.UseDllDirectoryForDependencies);

NativeLibrary.Load(
	"onnxruntime_providers_cuda.so",
	Assembly.GetEntryAssembly(),
	DllImportSearchPath.UseDllDirectoryForDependencies);

Additional context I decided to investigate elf binary and saw that function Provider_GetHost links nowhere:

libonnxruntime_providers_cuda.so:     file format elf64-x86-64

DYNAMIC SYMBOL TABLE:
...
0000000000000000      DF *UND*  0000000000000000  libcudnn.so.8 cudnnCreate
0000000000000000      DF *UND*  0000000000000000  GLIBCXX_3.4.21 _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE6insertEmPKc
0000000000000000      DF *UND*  0000000000000000  GLIBCXX_3.4.15 _ZNSt8__detail15_List_node_base11_M_transferEPS0_S1_
0000000000000000      D  *UND*  0000000000000000              Provider_GetHost
0000000000000000      DF *UND*  0000000000000000  libcudnn.so.8 cudnnFindConvolutionBackwardDataAlgorithmEx
0000000000000000      DF *UND*  0000000000000000  libcudart.so.11.0 cudaGetDeviceProperties
0000000000000000      DF *UND*  0000000000000000  libcudnn.so.8 cudnnSetActivationDescriptor
...

Patching elf to add a reference to onnxruntime_providers_shared.so (which i think this function placed) gave me only segmentation error:

   1246229:     symbol=Provider_GetHost;  lookup in file=dotnet [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/libpthread.so.0 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/libdl.so.2 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/libstdc++.so.6 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/libm.so.6 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/libgcc_s.so.1 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/libc.so.6 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/liblttng-ust-tracepoint.so.0 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/liburcu-bp.so.6 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/liburcu-cds.so.6 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/liburcu-common.so.6 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/usr/lib64/dotnet/shared/Microsoft.NETCore.App/5.0.9/libcoreclrtraceptprovider.so [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/liblttng-ust.so.0 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/librt.so.1 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/home/egortech/TestOnnx/net5.0/runtimes/linux-x64/native/libonnxruntime_providers_cuda.so [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/home/egortech/TestOnnx/net5.0/runtimes/linux-x64/native/libonnxruntime_providers_shared.so [0]
   1246229:     binding file /home/egortech/TestOnnx/net5.0/runtimes/linux-x64/native/libonnxruntime_providers_cuda.so [0] to /home/egortech/TestOnnx/net5.0/runtimes/linux-x64/native/libonnxruntime_providers_shared.so [0]: normal symbol `Provider_GetHost'
Segmentation fault (core dumped)

About this issue

Original URL
State: open
Created 3 years ago
Comments: 16 (5 by maintainers)

Most upvoted comments

As from the stackoverflow link, the point might actually be how RPATH propagation works in the context of a compiled extension: RPATH propagation only works when RPATH is set on the executable. For python extensions, the RPATH is only set in the extension itself (Python executables generally come without any RPATH or RUNPATH set, at least in all distributions I know). This is true for both onnxruntime python bindings and my own. The extension is a library, not an executable, so the RPATH will only be used to dlopen the direct dependencies. For CUDA libraries to be loaded, one thus needs to set the RPATH of onnxruntime_providers_shared (I believe, didn’t check the dependency tree).

I will try to verify if this is actually true in practice, I just need to find the time, and will try to prepare a PR if it works.

andrea-cimatoribus-pix4d on Sep 25, 2023

Thanks @RyanUnderhill , that helped to me understand and resolve the underlying issue. Maybe there is a way to highlight that as a solution .

@andrea-cimatoribus-pix4d , the problem is that i was loading the library libonnxruntime_providers_cuda.so and it complained with unresolved symbol “Provider_GetHost”. Provider_GetHost is defined in libonnxruntime_providers_shared.so However, libonnxruntime_providers_cuda.so doesn’t have dependency on libonnxruntime_providers_shared.so which could be a bug if directly loading this library is intended workflow However, the intended workflow as suggested is to load libonnxruntime.so and append CUDAproviders options so that libonnxruntime.so loads all the required including libonnxruntime_providers_cuda.so and libonnxruntime_providers_shared.so

parajav on Aug 30, 2023

@parajav See my reply above: https://github.com/microsoft/onnxruntime/issues/9309#issuecomment-940606615

You shouldn’t be loading libonnxruntime_providers_cuda.so Onnxruntime loads this as described above.

RyanUnderhill on Aug 29, 2023

Hmm, thank you for debugging further. The process to load it on Linux goes like this:

onnxruntime.so dynamically loads (dlopen) onnxruntime_providers_shared.so with RTLD_GLOBAL Then onnxruntime.so dynamically loads onnxruntime_providers_cuda.so (with RTLD_LOCAL). (On Windows there’s no global/local stuff, this is Linux specific).

The RTLD_GLOBAL should make it see Provider_GetHost from onnxruntime_providers_cuda.so. Can you tell if it’s getting preloaded somehow with the wrong setting?

RyanUnderhill on Oct 12, 2021