server: Successfully loaded torchscript model failed with "CUDA error: CUBLAS_STATUS_NOT_INITIALIZED" when called for inference

Description I converted a pytorch model to torchscript using the following script: https://gist.github.com/keskarnitish/1061cbd101ab186e2d80c7877517e7ee#file-saved_pytorch_model-py.

I tested the model using

import torch
model = torch.jit.load('model.pt')
example_outputs = model(example_inputs['input_ids'])

and it worked as expected.

I then deployed tritonserver:20.03-py3 on GKE on a node with T4 GPU.

I ran nvidia-smi on the node and got:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P0    32W /  70W |   3163MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

The triton server successfully loaded the model on the node. I checked the api status and it said that the model is ready.

But when I ran the perf_client, I got the following on the server logs:

I0525 05:24:42.733448 1 libtorch_backend.cc:538] Running bert with 1 request payloads
I0525 05:24:42.734669 1 pinned_memory_manager.cc:131] pinned memory allocation: size 256, addr 0x7f8a20000090
I0525 05:24:43.009041 1 libtorch_backend.cc:804] CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
The above operation failed in interpreter.
Traceback (most recent call last):
Serialized   File "code/__torch__.py", line 9
    _0 = self.model
    input_ids = torch.to(data, dtype=4, layout=0, device=torch.device("cuda"), pin_memory=False, non_blocking=False, copy=False, memory_format=None)
    return ((_0).forward(input_ids, ),)
             ~~~~~~~~~~~ <--- HERE
Serialized   File "code/__torch__/transformers/modeling_bert.py", line 10, in forward
    input_ids: Tensor) -> Tensor:
    _0 = self.classifier
    _1 = (self.dropout).forward((self.bert).forward(input_ids, ), )
                                 ~~~~~~~~~~~~~~~~~~ <--- HERE
    return (_0).forward(_1, )
class BertModel(Module):
Serialized   File "code/__torch__/transformers/modeling_bert.py", line 35, in forward
    _12 = torch.to(extended_attention_mask, 6, False, False, None)
    attention_mask0 = torch.mul(torch.rsub(_12, 1., 1), CONSTANTS.c0)
    _13 = (_3).forward((_4).forward(input_ids, input, ), attention_mask0, )
           ~~~~~~~~~~~ <--- HERE
    return (_2).forward(_13, )
class BertEmbeddings(Module):
Serialized   File "code/__torch__/transformers/modeling_bert.py", line 73, in forward
    attention_mask: Tensor) -> Tensor:
    _26 = getattr(self.layer, "1")
    _27 = (getattr(self.layer, "0")).forward(argument_1, attention_mask, )
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _28 = getattr(self.layer, "2")
    _29 = (_26).forward(_27, attention_mask, )
Serialized   File "code/__torch__/transformers/modeling_bert.py", line 107, in forward
    _49 = self.output
    _50 = self.intermediate
    _51 = (self.attention).forward(argument_1, attention_mask, )
           ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _52 = (_49).forward((_50).forward(_51, ), _51, )
    return _52
Serialized   File "code/__torch__/transformers/modeling_bert.py", line 119, in forward
    attention_mask: Tensor) -> Tensor:
    _53 = self.output
    _54 = (self.self).forward(argument_1, attention_mask, )
           ~~~~~~~~~~~~~~~~~~ <--- HERE
    return (_53).forward(_54, argument_1, )
class BertSelfAttention(Module):
Serialized   File "code/__torch__/transformers/modeling_bert.py", line 134, in forward
    _56 = self.value
    _57 = self.key
    _58 = (self.query).forward(argument_1, )
           ~~~~~~~~~~~~~~~~~~~ <--- HERE
    _59 = (_57).forward(argument_1, )
    _60 = (_56).forward(argument_1, )
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py(1612): linear
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/linear.py(87): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(216): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(314): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(368): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(407): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(734): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py(1142): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
<ipython-input-2-afc347149dec>(9): forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(534): _slow_forward
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py(548): __call__
/usr/local/lib/python3.6/dist-packages/torch/jit/__init__.py(1027): trace_module
/usr/local/lib/python3.6/dist-packages/torch/jit/__init__.py(875): trace
<ipython-input-2-afc347149dec>(13): <module>
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py(2882): run_code
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py(2822): run_ast_nodes
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py(2718): run_cell
/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py(537): run_cell
/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py(208): do_execute
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py(399): execute_request
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py(233): dispatch_shell
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py(283): dispatcher
/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py(277): null_wrapper
/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py(438): _run_callback
/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py(486): _handle_recv
/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py(456): _handle_events
/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py(277): null_wrapper
/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py(888): start
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py(499): start
/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py(664): launch_instance
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py(16): <module>
/usr/lib/python3.6/runpy.py(85): _run_code
/usr/lib/python3.6/runpy.py(193): _run_module_as_main
Serialized   File "code/__torch__/torch/nn/modules/linear.py", line 9, in forward
    argument_1: Tensor) -> Tensor:
    _0 = self.bias
    output = torch.matmul(argument_1, torch.t(self.weight))
             ~~~~~~~~~~~~ <--- HERE
    return torch.add_(output, _0, alpha=1)

I0525 05:24:43.009080 1 pinned_memory_manager.cc:158] pinned memory deallocation: addr 0x7f8a20000090

Triton Information What version of Triton are you using? 20.03

Are you using the Triton container or did you build it yourself? Triton container

To Reproduce Steps to reproduce the behavior.

See description.

Expected behavior The server should not return any error.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (9 by maintainers)

Most upvoted comments

Thanks for the detailed bug report, we will take a look.