server: free() invalid pointer

Description When I shut down triton inference server, there’s one line: 3067267c406779d44c4cda84e61911b

Triton Information What version of Triton are you using? 21.12

Are you using the Triton container or did you build it yourself? Here’s the dockerfile:

FROM nvcr.io/nvidia/tritonserver:21.12-py3
LABEL maintainer="NVIDIA"
LABEL repository="tritonserver"

RUN apt-get update && apt-get -y install swig && apt-get -y install python3-dev && apt-get install -y cmake
RUN pip3 install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio==0.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
RUN pip3 install -v kaldifeat

Here’s the model.py.

import kaldifeat

class TritonPythonModel:

    def initialize(self, args):
        pass

    def execute(self, requests):
        pass

    def finalize(self):
        """`finalize` is called only once when the model is being unloaded.
        Implementing `finalize` function is OPTIONAL. This function allows
        the model to perform any necessary clean ups before exit.
        """
        print('Cleaning up...')

config.pbtxt

name: "model"
backend: "python"
max_batch_size: 64

input [
  {
    name: "wav"
    data_type: TYPE_FP32
    dims: [-1]
  },
  {
    name: "wav_lens"
    data_type: TYPE_INT32
    dims: [1]
  }
]

output [
  {
    name: "speech"
    data_type: TYPE_FP16
    dims: [-1, 80]  # 80
  },
  {
    name: "speech_lengths"
    data_type: TYPE_INT32
    dims: [1]
  }
]

dynamic_batching {
    preferred_batch_size: [ 16, 32 ]
  }
instance_group [
    {
      count: 1
      kind: KIND_GPU
    }
]

To Reproduce

  1. Build docker based on the above dockerfile.
  2. Run the model_repo with model.py in it.
  3. Shut down triton by ‘ctrl-c’

Expected behavior Expect no such line.

I test on 2 different machine. Both will give this error? warning? One will not generate core, and another will generate a core file.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 23 (12 by maintainers)

Most upvoted comments

The module kaldifeat has lots of leaks and invalid read/writes on import.

This can be verified using:

valgrind python3 -c "import kaldifeat; print(kaldifeat.__version__)"

However, we do not see the free() invalid pointer error in this case. Running Triton in valgrind with --trace-children=yes gives more details about the invalid free:

==16111== Invalid free() / delete / delete[] / realloc()
==16111==    at 0x483CFBF: operator delete(void*) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==16111==    by 0x13E724: pybind11::finalize_interpreter() (in /tmp/host/model_repo/test_model/triton_python_backend_stub)
==16111==    by 0x11C363: main (in /tmp/host/model_repo/test_model/triton_python_backend_stub)
==16111==  Address 0x44eedf48 is 24 bytes inside a block of size 65 alloc'd
==16111==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==16111==    by 0x5219378: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==16111==    by 0x521A271: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==16111==    by 0x521A327: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==16111==    by 0x521A5E1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==16111==    by 0x4B03EF77: ??? (in /usr/local/lib/python3.8/dist-packages/_kaldifeat.cpython-38-x86_64-linux-gnu.so)
==16111==    by 0x4B041147: ??? (in /usr/local/lib/python3.8/dist-packages/_kaldifeat.cpython-38-x86_64-linux-gnu.so)
==16111==    by 0x4B0301D8: ??? (in /usr/local/lib/python3.8/dist-packages/_kaldifeat.cpython-38-x86_64-linux-gnu.so)
==16111==    by 0x4B02941F: PyInit__kaldifeat (in /usr/local/lib/python3.8/dist-packages/_kaldifeat.cpython-38-x86_64-linux-gnu.so)
==16111==    by 0x4D7C095: _PyImport_LoadDynamicModuleWithSpec (in /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0)
==16111==    by 0x4D7E104: ??? (in /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0)
==16111==    by 0x4E34526: ??? (in /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0)
==16111== 

The trace demostrates the free() invalid pointer originates in pybind11::finalize_interpreter() clean-up. The issue comes up when importing kaldifeat with pybind11. A simple reproducer is described below:

main.cpp :

#include <pybind11/embed.h> // everything needed for embedding
#include <iostream>

namespace py = pybind11;

int main() {
    py::scoped_interpreter guard{}; // start the interpreter and keep it alive
    py::module_ kaldifeat = py::module_::import("kaldifeat");
    std::cerr << "Module Loaded" << std::endl;
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.17)
project(example)

include(FetchContent)

FetchContent_Declare(
  pybind11
  GIT_REPOSITORY "https://github.com/pybind/pybind11"
  GIT_TAG "v2.6"
  GIT_SHALLOW ON
)
FetchContent_MakeAvailable(pybind11)


add_executable(example main.cpp)
target_link_libraries(example PRIVATE pybind11::embed)
~                                                          

In the directory with these file run the following commands:

cmake .
make example
./example

When running the example we see the below issue:

./example 
Module Loaded
free(): invalid pointer
Aborted (core dumped)

The backtrace for the Invalid free for example:

==16174== Invalid free() / delete / delete[] / realloc()
==16174==    at 0x483CFBF: operator delete(void*) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==16174==    by 0x129EA3: void __gnu_cxx::new_allocator<std::_Fwd_list_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::destroy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) (in /tmp/host/py_invalid_free/example)
==16174==    by 0x1251CC: void std::allocator_traits<std::allocator<std::_Fwd_list_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::destroy<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::allocator<std::_Fwd_list_node<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) (in /tmp/host/py_invalid_free/example)
==16174==    by 0x120B06: std::_Fwd_list_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_M_erase_after(std::_Fwd_list_node_base*, std::_Fwd_list_node_base*) (in /tmp/host/py_invalid_free/example)
==16174==    by 0x11CF41: std::_Fwd_list_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::~_Fwd_list_base() (in /tmp/host/py_invalid_free/example)
==16174==    by 0x11C9DB: std::forward_list<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::~forward_list() (in /tmp/host/py_invalid_free/example)
==16174==    by 0x11246C: pybind11::detail::internals::~internals() (in /tmp/host/py_invalid_free/example)
==16174==    by 0x11C05E: pybind11::finalize_interpreter() (in /tmp/host/py_invalid_free/example)
==16174==    by 0x11C14B: pybind11::scoped_interpreter::~scoped_interpreter() (in /tmp/host/py_invalid_free/example)
==16174==    by 0x10E5A5: main (in /tmp/host/py_invalid_free/example)
==16174==  Address 0x20f9cce8 is 24 bytes inside a block of size 65 alloc'd
==16174==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==16174==    by 0x4EA2378: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==16174==    by 0x4EA3271: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==16174==    by 0x4EA3327: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==16174==    by 0x4EA35E1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28)
==16174==    by 0x47D5AF77: ??? (in /usr/local/lib/python3.8/dist-packages/_kaldifeat.cpython-38-x86_64-linux-gnu.so)
==16174==    by 0x47D5D147: ??? (in /usr/local/lib/python3.8/dist-packages/_kaldifeat.cpython-38-x86_64-linux-gnu.so)
==16174==    by 0x47D4C1D8: ??? (in /usr/local/lib/python3.8/dist-packages/_kaldifeat.cpython-38-x86_64-linux-gnu.so)
==16174==    by 0x47D4541F: PyInit__kaldifeat (in /usr/local/lib/python3.8/dist-packages/_kaldifeat.cpython-38-x86_64-linux-gnu.so)
==16174==    by 0x4A05095: _PyImport_LoadDynamicModuleWithSpec (in /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0)
==16174==    by 0x4A07104: ??? (in /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0)
==16174==    by 0x4ABD526: ??? (in /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0)

As you can see the free() invalid pointer is raised even when running outside Triton Python Backend. It is coming from pybind11::finalize_interpreter() when running both within Triton and outside Triton. I have tried the latest pybind11 v2.9.0, it gives the same issue.

Closing the issue as the issue is reproducible outside Triton and is shown to manifest when importing kaldifeat within pybind11 interpreter.

@csukuangfj Thank you!

Did you see the free() invalid pointer issue when using the below code:

py::module_ kaldifeat = py::module_::import("torch");

It’s quite strange when we import torch in triton, we didn’t see this issue. But when it comes to kaldifeat, the issue occurs…

Yes, it is reproducible.

Screen Shot 2022-02-07 at 8 41 58 PM

Hello, I would like to ask a question. I am using triton 22.04-py3 version of docker. When the specified backend is python, the free problem also occurs when unloading the model. Is it because of kaldifeat?

I just created a GitHub repo to reproduce the core dump issue by changing import kaldifeat to import torch. Please see https://github.com/csukuangfj/memory-leak-example

You can see the output from GitHub actions at https://github.com/csukuangfj/memory-leak-example/runs/5179267107?check_suite_focus=true

A screenshot of the output is given below: Screen Shot 2022-02-14 at 1 41 34 PM

kaldifeat uses PyTorch C++ API and it is the responsibility of PyTorch to manage the memory.


[edited]: So memory issues with kaldifeat should be reproducible by replacing kaldifeat with torch.