onnxruntime: Can't load Cuda Provider on Linux due symbol lookup error

Describe the bug I am trying to load OnnxRuntime library with CUDA provider in C# application but get symbol lookup error:

dotnet: symbol lookup error: /home/egortech/TestOnnx/net5.0/runtimes/linux-x64/native/libonnxruntime_providers_cuda.so: undefined symbol: Provider_GetHost

Urgency If there are particular important use cases blocked by this or strict project-related timelines, please share more information and dates. If there are no hard deadlines, please specify none.

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04.3 LTS
  • ONNX Runtime installed from (source or binary): binary but trying built from source as well
  • ONNX Runtime version: 1.10
  • Python version: Not Applicable
  • Visual Studio version (if applicable): 2019
  • GCC/Compiler version (if compiling from source): 9.3.0
  • CUDA/cuDNN version: 11.4/8.2.0.6
  • GPU model and memory: NVidia GeForce 2070 8GB

To Reproduce I am trying to preload Onnx Runtime Libs in c# with that code:

NativeLibrary.Load(
	"onnxruntime.so",
	Assembly.GetEntryAssembly(),
	DllImportSearchPath.UseDllDirectoryForDependencies);

NativeLibrary.Load(
	"onnxruntime_providers_shared.so",
	Assembly.GetEntryAssembly(),
	DllImportSearchPath.UseDllDirectoryForDependencies);

NativeLibrary.Load(
	"onnxruntime_providers_cuda.so",
	Assembly.GetEntryAssembly(),
	DllImportSearchPath.UseDllDirectoryForDependencies);

Additional context I decided to investigate elf binary and saw that function Provider_GetHost links nowhere: image

libonnxruntime_providers_cuda.so:     file format elf64-x86-64

DYNAMIC SYMBOL TABLE:
...
0000000000000000      DF *UND*  0000000000000000  libcudnn.so.8 cudnnCreate
0000000000000000      DF *UND*  0000000000000000  GLIBCXX_3.4.21 _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE6insertEmPKc
0000000000000000      DF *UND*  0000000000000000  GLIBCXX_3.4.15 _ZNSt8__detail15_List_node_base11_M_transferEPS0_S1_
0000000000000000      D  *UND*  0000000000000000              Provider_GetHost
0000000000000000      DF *UND*  0000000000000000  libcudnn.so.8 cudnnFindConvolutionBackwardDataAlgorithmEx
0000000000000000      DF *UND*  0000000000000000  libcudart.so.11.0 cudaGetDeviceProperties
0000000000000000      DF *UND*  0000000000000000  libcudnn.so.8 cudnnSetActivationDescriptor
...

Patching elf to add a reference to onnxruntime_providers_shared.so (which i think this function placed) gave me only segmentation error:

   1246229:     symbol=Provider_GetHost;  lookup in file=dotnet [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/libpthread.so.0 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/libdl.so.2 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/libstdc++.so.6 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/libm.so.6 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/libgcc_s.so.1 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/libc.so.6 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/liblttng-ust-tracepoint.so.0 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/liburcu-bp.so.6 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/liburcu-cds.so.6 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/liburcu-common.so.6 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/usr/lib64/dotnet/shared/Microsoft.NETCore.App/5.0.9/libcoreclrtraceptprovider.so [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/liblttng-ust.so.0 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/lib64/librt.so.1 [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/home/egortech/TestOnnx/net5.0/runtimes/linux-x64/native/libonnxruntime_providers_cuda.so [0]
   1246229:     symbol=Provider_GetHost;  lookup in file=/home/egortech/TestOnnx/net5.0/runtimes/linux-x64/native/libonnxruntime_providers_shared.so [0]
   1246229:     binding file /home/egortech/TestOnnx/net5.0/runtimes/linux-x64/native/libonnxruntime_providers_cuda.so [0] to /home/egortech/TestOnnx/net5.0/runtimes/linux-x64/native/libonnxruntime_providers_shared.so [0]: normal symbol `Provider_GetHost'
Segmentation fault (core dumped)

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 16 (5 by maintainers)

Most upvoted comments

As from the stackoverflow link, the point might actually be how RPATH propagation works in the context of a compiled extension: RPATH propagation only works when RPATH is set on the executable. For python extensions, the RPATH is only set in the extension itself (Python executables generally come without any RPATH or RUNPATH set, at least in all distributions I know). This is true for both onnxruntime python bindings and my own. The extension is a library, not an executable, so the RPATH will only be used to dlopen the direct dependencies. For CUDA libraries to be loaded, one thus needs to set the RPATH of onnxruntime_providers_shared (I believe, didn’t check the dependency tree).

I will try to verify if this is actually true in practice, I just need to find the time, and will try to prepare a PR if it works.

Thanks @RyanUnderhill , that helped to me understand and resolve the underlying issue. Maybe there is a way to highlight that as a solution .

@andrea-cimatoribus-pix4d , the problem is that i was loading the library libonnxruntime_providers_cuda.so and it complained with unresolved symbol “Provider_GetHost”. Provider_GetHost is defined in libonnxruntime_providers_shared.so However, libonnxruntime_providers_cuda.so doesn’t have dependency on libonnxruntime_providers_shared.so which could be a bug if directly loading this library is intended workflow However, the intended workflow as suggested is to load libonnxruntime.so and append CUDAproviders options so that libonnxruntime.so loads all the required including libonnxruntime_providers_cuda.so and libonnxruntime_providers_shared.so

@parajav See my reply above: https://github.com/microsoft/onnxruntime/issues/9309#issuecomment-940606615

You shouldn’t be loading libonnxruntime_providers_cuda.so Onnxruntime loads this as described above.

Hmm, thank you for debugging further. The process to load it on Linux goes like this:

onnxruntime.so dynamically loads (dlopen) onnxruntime_providers_shared.so with RTLD_GLOBAL Then onnxruntime.so dynamically loads onnxruntime_providers_cuda.so (with RTLD_LOCAL). (On Windows there’s no global/local stuff, this is Linux specific).

The RTLD_GLOBAL should make it see Provider_GetHost from onnxruntime_providers_cuda.so. Can you tell if it’s getting preloaded somehow with the wrong setting?