dali_backend: Segfault when max_batch_size > 1

Hi everybody

I am facing issues when enabling dynamic scheduler with a max_batch_size bigger than 1, which gives me a segfault when submitting requests. In the main readme it says, that dali requires homogenous batch sizes. How would I achieve that when using the triton C API directly? In the tests introduced with the PR enabling dynamic batching, I can’t find anything enforcing homogenous batch sizes. Am I missing something?

We are using the C API of the triton r21.06 release with a dali pipeline which is created with a batch size of 64 and then set the max_batch_size in the triton config.pbtxt file to 32 for all elements of the ensemble model.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15 (6 by maintainers)

Most upvoted comments

@MaxHuerlimann ,

we’ve narrowed down the issue and fixed it. Here’s the PR: https://github.com/NVIDIA/DALI/pull/4043

The change will be released in Triton 22.08.

@MaxHuerlimann ,

that’s actually one challenging debugging, but I’m working on it right now. Hopefully I’d have some conclusion in a day or two 😃

I have used the perf_analyzer tool and used this data repro_data.zip with a batch size of 1 of each request and testing different concurrency values, doesn’t really matter which one as it happens all the time.

I can check out if I can reproduce the issue with your repro client, will get back to you.

Hello again!

I have come back to this issue now as we are experimenting with using the docker deployment of triton (22.05) and we are still facing this issue. I have managed to pinpoint it to the crop operator. If I try to feed it a batch of crop windows (as we are detecting objects in an image and want to crop them on a per-image basis), the triton process crashes with

Signal (11) received.
 0# 0x0000558BBBD771B9 in tritonserver
 1# 0x00007F886FFD80C0 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# float dali::OpSpec::GetArgumentImpl<float, float>(std::string const&, dali::ArgumentWorkspace const*, long) const in /opt/tritonserver/backends/dali/dali/libdali_operators.so
 3# 0x00007F86D2B4826E in /opt/tritonserver/backends/dali/dali/libdali_operators.so
 4# 0x00007F86D25D1F76 in /opt/tritonserver/backends/dali/dali/libdali_operators.so
 5# 0x00007F86D2597B12 in /opt/tritonserver/backends/dali/dali/libdali_operators.so
 6# void dali::Executor<dali::AOT_WS_Policy<dali::UniformQueuePolicy>, dali::UniformQueuePolicy>::RunHelper<dali::DeviceWorkspace>(dali::OpNode&, dali::DeviceWorkspace&) in /opt/tritonserver/backends/dali/dali/libdali.so
 7# dali::Executor<dali::AOT_WS_Policy<dali::UniformQueuePolicy>, dali::UniformQueuePolicy>::RunGPUImpl() in /opt/tritonserver/backends/dali/dali/libdali.so
 8# dali::Executor<dali::AOT_WS_Policy<dali::UniformQueuePolicy>, dali::UniformQueuePolicy>::RunGPU() in /opt/tritonserver/backends/dali/dali/libdali.so
 9# 0x00007F884537E228 in /opt/tritonserver/backends/dali/dali/libdali.so
10# 0x00007F88453F78BC in /opt/tritonserver/backends/dali/dali/libdali.so
11# 0x00007F88459DAB6F in /opt/tritonserver/backends/dali/dali/libdali.so
12# 0x00007F88715D7609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
13# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Is there a recommended way how to feed a batch of cropping windows to a crop a batch of images with?

A minimal example for reproduction should be:

import nvidia.dali.fn as fn
from nvidia.dali import pipeline_def


@pipeline_def(batch_size=32, num_threads=4, device_id=0)
def pipeline():
    images = fn.external_source(device="cpu", name="IMAGE")
    crop_x = fn.external_source(device="cpu", name="CROP_X")
    crop_y = fn.external_source(device="cpu", name="CROP_Y")
    crop_width = fn.external_source(device="cpu", name="CROP_WIDTH")
    crop_height = fn.external_source(device="cpu", name="CROP_HEIGHT")

    images = fn.decoders.image(images, device="mixed")
    images = fn.crop(
        images,
        crop_pos_x=crop_x,
        crop_pos_y=crop_y,
        crop_w=crop_width,
        crop_h=crop_height
    )
    images = fn.resize(
        images,
        resize_x=288,
        resize_y=384,
        mode="not_larger",
    )
    images = fn.pad(images, fill_value=128, axes=(0, 1), shape=(384, 288))
    return images


def main():
    pipeline().serialize(filename='1/model.dali')


if __name__ == "__main__":
    main()

and with configuration

name: "dali_test"
backend: "dali"
max_batch_size: 32
dynamic_batching {
  preferred_batch_size: [ 32 ]
  max_queue_delay_microseconds: 500
}
instance_group [
        {
                count: 1
                kind: KIND_GPU
        }
]
input [
        {
                name: "IMAGE"
                data_type: TYPE_UINT8
                dims: [ -1 ]
                allow_ragged_batch: true
        },
        {
                name: "CROP_X"
                data_type: TYPE_FP32
                dims: [ 1 ]
        },
        {
                name: "CROP_Y"
                data_type: TYPE_FP32
                dims: [ 1 ]
        },
        {
                name: "CROP_WIDTH"
                data_type: TYPE_FP32
                dims: [ 1 ]
        },
        {
                name: "CROP_HEIGHT"
                data_type: TYPE_FP32
                dims: [ 1 ]
        }
]
output [
        {
                name: "PREPROCESSED_IMAGE"
                data_type: TYPE_FP32
                dims: [ 3, 384, 288 ]
        }
]

I will close this for now, as I don’t have the capacity to reproduce this with extra code (as can be seen by the long inactivity) and the inference latency does not seem to drastically impacted. I will reopen this once I can tackle the issue again.