server: infrence of torch script model much slower with triton than python environment

Description A clear and concise description of what the bug is. I convert models in an open source OCR project with torch.jit.trace() and put it on triton to do inference, but the inference speed much slower than I just run the traced model in python environment

Triton Information What version of Triton are you using? 21.04 py3 Are you using the Triton container or did you build it yourself? container To Reproduce Steps to reproduce the behavior. pip install easyocr ` import easyocr import torch reader = easyocr.Reader([‘ch_sim’,‘en’],quantize=False,gpu=False) imgH=640 imgW=352 max_length = 36 batch_size = 8 image = torch.ones([1,3,imgH, imgW]) image = torch.autograd.Variable(image).cuda() img = torch.ones([batch_zie, 1, 64, 256]).cuda() text = torch.ones([1,int(max_length+1)]).cuda()

detector = reader.detector.cuda() scripted_detector = torch.jit.trace(detector, image) scripted_detector.save(‘scripted_detector.pt’) recognizer = reader.recognizer.cuda() scripted_recognizer = torch.jit.trace(recognizer,(img, text)) scripted_recognizer.save(‘scripted_recognizer.pt’) ` ’ name: “detector” platform: “pytorch_libtorch” max_batch_size: 1 input[ { name: “input_0” data_type: TYPE_FP32 dims: [3,352,640] } ]

output[ { name: “output_0” data_type: TYPE_FP32 dims: [-1, -1] }, { name: “output_1” data_type: TYPE_FP32 dims: [-1, -1] } ] instance_group [ { count: 1 kind: KIND_GPU gpus: [0] } ]

’ `name: “detector” platform: “pytorch_libtorch” max_batch_size: 1 input[ { name: “input_0” data_type: TYPE_FP32 dims: [3,352,640] } ]

output[ { name: “output_0” data_type: TYPE_FP32 dims: [-1, -1] }, { name: “output_1” data_type: TYPE_FP32 dims: [-1, -1] } ] instance_group [ { count: 1 kind: KIND_GPU gpus: [0] } ]

’ name: “detector” platform: “pytorch_libtorch” max_batch_size: 1 input[ { name: “input_0” data_type: TYPE_FP32 dims: [3,352,640] } ]

output[ { name: “output_0” data_type: TYPE_FP32 dims: [-1, -1] }, { name: “output_1” data_type: TYPE_FP32 dims: [-1, -1] } ] instance_group [ { count: 1 kind: KIND_GPU gpus: [0] } ]

’ `name: “recognizer” platform: “pytorch_libtorch” max_batch_size: 1 input[ { name: “input_0” data_type: TYPE_FP32 dims: [1,64,256] }, name: “input_1” data_type: TYPE_INT64 dims: [1,37] ]

output[ { name: “output_0” data_type: TYPE_FP32 dims: [-1, -1] }, ] instance_group [ { count: 1 kind: KIND_GPU gpus: [0] } ] ` Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). vgg based cnn rcnn Expected behavior A clear and concise description of what you expected to happen. inference time similar or faster than just using easyOCR project with python, gpu

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 37 (19 by maintainers)

Most upvoted comments

It looks like it is fixed.

Our EfficientNet latency is now comparable to what we saw on Triton 20.07.

@koval reran his repro script, and performance for 21.11 with autocast enabled looks good:

# python3 perf_conv2d.py 1000
Measuring latency with autocast_enabled=False
Avg time per zeropad2d + conv2d (ms): 0.2254070260005392

Measuring latency with autocast_enabled=True
Avg time per zeropad2d + conv2d (ms): 0.22318216899839172

We have one EfficientNet based model still showing a 2x latency hit (vs. 10x previously), but this may be a separate issue. We’ll do some profiling and open a fresh ticket if we can narrow down the cause.

Thank you for resolving this!

@Tabrizian All my investigation was done on NVIDIA T4 GPU in g4dn.xlarge AWS instance. The numbers for the official PyTorch releases come from nvcr.io/nvidia/tritonserver:21.07-py3 container where I installed PyTorch via pip (both stable 1.9 version and nightly 1.10 and 1.11 versions).