TensorRT: CopyPackedKernel is taking too long, and how to optimize it

Description

I have a model that uses a slice operator for feature crossing, but it turns out that the slice operator calls the CopyPackedKernel API, and it consumes a lot of time. I also re-implemented the slice operator myself, but the same result was achieved,I don’t know when CopyPackedKernel is running, how to optimize it.

nsys profile -o test --stats=true  python infer.py -e test.plan

output:

Time(%)      Total Time   Instances         Average         Minimum         Maximum  Name
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------------------------------------------
   99.9         9863089        2835          3479.0            3423            3872  void genericReformat::copyPackedKernel<float, float, true, true, genericReformat::IdentityCoordMapper<4>, 4>(unsigned int, unsigned int, void const*, genericReformat::ArrayN<4>, genericReformat::ArrayNWithReducedDivisors<4>, genericReformat::ArrayN<4>, int, int, int, float const*, void*, genericReformat::ArrayN<4>, genericReformat:
    0.1            7264           3          2421.3            2304            2656  slice(float const*, float*, int, int, int, int)

Environment

TensorRT Version: 7.2.2.1 NVIDIA GPU: T4 NVIDIA Driver Version: 450.51.06 CUDA Version: 11.1 CUDNN Version: Operating System: Python Version (if applicable): 3.8 Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version): Container 20.12

I need help, thank you very much.

About this issue

Most upvoted comments

Hello @zhaohb , just checked those are not really reformat, they are slice implementation. And slice is memory bound, to optimize this we need consider other approaches. I have two questions:

  • what’s the input of the slice in your real network? are they constant? If so we can first preprocess these slice and replace with constant
  • what’s the output of the slice in your real network? If they are plugins can we adjust the plugin implementation, like calculate the address/offset inside the plugin, then we can remove the slice?

thanks!