TensorRT: CopyPackedKernel is taking too long, and how to optimize it

Description

I have a model that uses a slice operator for feature crossing, but it turns out that the slice operator calls the CopyPackedKernel API, and it consumes a lot of time. I also re-implemented the slice operator myself, but the same result was achieved,I don’t know when CopyPackedKernel is running, how to optimize it.

nsys profile -o test --stats=true  python infer.py -e test.plan

output:

Time(%)      Total Time   Instances         Average         Minimum         Maximum  Name
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------------------------------------------
   99.9         9863089        2835          3479.0            3423            3872  void genericReformat::copyPackedKernel<float, float, true, true, genericReformat::IdentityCoordMapper<4>, 4>(unsigned int, unsigned int, void const*, genericReformat::ArrayN<4>, genericReformat::ArrayNWithReducedDivisors<4>, genericReformat::ArrayN<4>, int, int, int, float const*, void*, genericReformat::ArrayN<4>, genericReformat:
    0.1            7264           3          2421.3            2304            2656  slice(float const*, float*, int, int, int, int)

Environment

TensorRT Version: 7.2.2.1 NVIDIA GPU: T4 NVIDIA Driver Version: 450.51.06 CUDA Version: 11.1 CUDNN Version: Operating System: Python Version (if applicable): 3.8 Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version): Container 20.12

I need help, thank you very much.

About this issue

Original URL
State: open
Created 3 years ago
Comments: 36

Most upvoted comments

Hello @zhaohb , just checked those are not really reformat, they are slice implementation. And slice is memory bound, to optimize this we need consider other approaches. I have two questions:

what’s the input of the slice in your real network? are they constant? If so we can first preprocess these slice and replace with constant
what’s the output of the slice in your real network? If they are plugins can we adjust the plugin implementation, like calculate the address/offset inside the plugin, then we can remove the slice?

thanks!

ttyio on Mar 17, 2021