server: Python backend on CPU is slower when serving a pytorch model

Description I have a python model that uses pre-trained roberta model for the inference. I have added this model to Triton to use python backend to serve. We also have the exact same python code/model being served using an fastapi application. Both are running on hardware with same specs. When I compared both the models in terms of performance on CPU, the latency with Triton is very high. I used pytorch profiler to profile the code to debug what is causing the higher latencies with Triton. Below screenshots shows the outputs of pytorch profiler.

Triton-CPU

triton-cpu

FastAPI-CPU

api-cpu

Based on the screenshots I can see that particularly the native_layer_norm is taking significantly longer with Triton when compared with model running using our fastapi application. native_layer_norm is part of the pre-trained roberta model.

Triton Information What version of Triton are you using? Version: 21.07

Are you using the Triton container or did you build it yourself? I built the image myself based on r21.07 but I have also tested serving the model using Official Triton Containers-r21.07 and r21.08 the issue still remains the same

To Reproduce Steps to reproduce the behavior.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Dependencies: torch==1.6.0 transformers==3.5.1

config.pbtxt

name: "sample-model"
backend: "python"
max_batch_size: 8

input [
  {
    name: "INPUT0"
    data_type: TYPE_STRING
    dims:  [1]

  }
]

output [
  {
    name: "OUTPUT0"
    data_type: TYPE_STRING
    dims: [1]
  }
]

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "<path to execution env>"}
}

instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]

Expected behavior Ideally the performance should be similar when the same model is being run with Triton

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 29 (16 by maintainers)

Most upvoted comments

@tanmayv25 In my initial testing the results looks good. The performance is greatly improved. Below is some summary from initial testing

Before Fix

Inferences/Second vs. Client Average Batch Latency
Concurrency: 5, throughput: 3.05 infer/sec, latency 1655480 usec
Concurrency: 10, throughput: 3.17 infer/sec, latency 3153540 usec
Concurrency: 15, throughput: 3.21 infer/sec, latency 4687196 usec

After Fix

Inferences/Second vs. Client Average Batch Latency
Concurrency: 5, throughput: 17.72 infer/sec, latency 282101 usec
Concurrency: 10, throughput: 17.75 infer/sec, latency 562845 usec
Concurrency: 15, throughput: 17.95 infer/sec, latency 836044 usec

I have some more testing pending. I will update here Once I am done with the complete testing.

@tanmayv25 ok, thank you very much.

@tanmayv25 Thank you for running the tests and sharing the results. For my testing, I used Jmeter in non-gui mode. Also, the tests are run in an instance that is separate from where actually the Triton and FastAPI app are running. So it shouldn’t affect the performance of Triton. Let me re-run the tests from my end and I will share the results and also share the fastapi script.

Ok… Let me run with concurrency of 10 and share that with you.

@tanmayv25 Unfortunately, I couldn’t share the actual model but I tried to reproduce the issue using a different model. It is not as slow as our model but with increase in request load the model is performing slower and slower. please find below the required files. you can download the model files from here: https://drive.google.com/drive/folders/1nzC2_GFh27mt8KP4dfGxewFP8BkEQEHH?usp=sharing

I built this model using the below notebook and saved the model state_dict and used it for inference. https://colab.research.google.com/github/DhavalTaunk08/NLP_scripts/blob/master/sentiment_analysis_using_roberta.ipynb

Triton Model.py

import numpy as np
import json
import triton_python_backend_utils as pb_utils

import torch

from transformers import RobertaModel, RobertaTokenizer


class RobertaClass(torch.nn.Module):
    def __init__(self):
        super(RobertaClass, self).__init__()
        self.l1 = RobertaModel.from_pretrained("roberta-base")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.3)
        self.classifier = torch.nn.Linear(768, 5)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

class TritonPythonModel:

    def initialize(self, args):

        # You must parse model_config. JSON string is not parsed here
        self.model_config = model_config = json.loads(args['model_config'])

        # Get OUTPUT0 configuration
        output0_config = pb_utils.get_output_config_by_name(
            model_config, "OUTPUT0")

        # Convert Triton types to numpy types
        self.output0_dtype = pb_utils.triton_string_to_numpy(
            output0_config['data_type'])
        self.model = RobertaClass()
        self.model.load_state_dict(torch.load('/models/roberta_test/1/files/pytorch_roberta_sentiment.bin', map_location=torch.device('cpu')))
        self.model.eval()
        self.tokenizer = RobertaTokenizer.from_pretrained('/models/roberta_test/1/files/', truncation=True, do_lower_case=True)

    def preprocess_data(self, sentence):
        inputs = self.tokenizer.encode_plus(
            sentence,
            None,
            add_special_tokens=True,
            max_length=256,
            pad_to_max_length=True,
            return_token_type_ids=True
        )

        ids = [inputs['input_ids']]
        mask = [inputs['attention_mask']]
        token_type_ids = inputs["token_type_ids"]
        data = {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long)
        }

        return data

    def execute(self, requests):

        output0_dtype = self.output0_dtype
        responses = []
        for request in requests:
            in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
            payload = json.loads(in_0.as_numpy()[0][0].decode("utf-8"))
            sentence = payload["data"]
            data = self.preprocess_data(sentence)
            with torch.no_grad():
                outputs = self.model(data['ids'], data['mask'], data['token_type_ids']).squeeze()
            result = torch.argmax(outputs).item()
            out_tensor_0 = pb_utils.Tensor("OUTPUT0",
                                           np.array(str(result), dtype='object').astype(output0_dtype))

            inference_response = pb_utils.InferenceResponse(
                output_tensors=[out_tensor_0])
            responses.append(inference_response)

        return responses

    def finalize(self):
        print('Cleaning up...')

Payload for Triton:

{
  "inputs": [
    {
      "name": "INPUT0",
      "shape": [ 1, 1 ],
      "datatype": "BYTES",
      "data": [
        ["{\"data\":\"A series of escapades demonstrating the adage that what is good for the goose\"}"]
      ]
    }
  ]
}

python app.py

import torch
from transformers import RobertaModel, RobertaTokenizer


class RobertaClass(torch.nn.Module):
    def __init__(self):
        super(RobertaClass, self).__init__()
        self.l1 = RobertaModel.from_pretrained("roberta-base")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.3)
        self.classifier = torch.nn.Linear(768, 5)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output


class SentimentModel:

    def __init__(self):
        self.model = RobertaClass()
        self.model.load_state_dict(torch.load('files/pytorch_roberta_sentiment.bin', map_location=torch.device('cpu')))
        self.model.eval()
        self.tokenizer = RobertaTokenizer.from_pretrained('files', truncation=True, do_lower_case=True)

    def predict(self, request):
        sentence = request["data"]
        data = self.preprocess_data(sentence)
        with torch.no_grad():
            outputs = self.model(data['ids'], data['mask'], data['token_type_ids']).squeeze()
            result = torch.argmax(outputs).item()

        return {"result": str(result)}

    def preprocess_data(self, sentence):
        inputs = self.tokenizer.encode_plus(
            sentence,
            None,
            add_special_tokens=True,
            max_length=256,
            pad_to_max_length=True,
            return_token_type_ids=True
        )

        ids = [inputs['input_ids']]
        mask = [inputs['attention_mask']]
        token_type_ids = inputs["token_type_ids"]
        data = {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long)
        }

        return data

if __name__ == '__main__':
    sm = SentimentModel()
    data = {"data": "A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story ."}
    print(sm.predict(data))

@SaratM34 were the same version of PyTorch used in both cases? The slowdown appears to be framework specific and not from inside Triton. cc @Tabrizian