onnxruntime: Help finding a weird memory leak.
The situation is a little bit complicated. I’m using a detection model to detect objects in a video stream. My program is written in Java and onnxruntime is used through a simple JNI wrapper written myself for model inference. onnxruntime’s C++ API is used.
The program only leaks memory when running on a server with T4 cards, no (significant) leaks found when running on a server with 1080 cards. The program is running inside docker, so all binary files are exactly the same, we also installed the same driver.
When running with T4 cards, the program leaks fast, about 100M per second as shown by top.
onnxruntime is built from source code, with the following configuration command:
python3 ./tools/ci_build/build.py
--build_dir ${CMAKE_BINARY_DIR}/3rdparty_onnxruntime-prefix/build
--config RelWithDebInfo --skip_submodule_sync --use_cuda
--cudnn_home /usr/local/cuda --cuda_home /usr/local/cuda
--skip_tests --build_shared_lib --parallel --cmake_path ${CMAKE_COMMAND}
--cmake_extra_defines CMAKE_C_COMPILER=${CMAKE_C_COMPILER}
CMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
CMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
CMAKE_INSTALL_PREFIX=${CMAKE_BINARY_DIR}/3rdparty_prefix
I had no luck trying to strip down the program to reproduce the leak with minimal code. valgrind also showed no significant leaks (compared with 100M/s leaks) in its log file. BTW, the program may execute different path due to poor performance of valgrind.
Does onnxruntime has different execution path on T4 cards ? Any suggestion is welcome.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (10 by maintainers)
same question in ort-gpu==1.10, same T4 GPU #10095
I do not have T4 handy. Try to take
memory_infoandinputout of the loop and see what changes. Also can you try running the same code on CPU, please?