server: python backend error: c_python_backend_utils.TritonModelException: Tensor is stored in GPU and cannot be converted to NumPy

Description I am currently using the Python Backend BLS function and called another tensorrt model using the pb_utils.inferencerequest interface and the call succeeded, but the result is stored on the GPU，and I can’t find how to copy the interface from the GPU.

Triton Information 22.01

Are you using the Triton container or did you build it yourself? no

Expected behavior Can python backend copy InferenceRequest results directly to the CPU？

Here is my debugging information：

(Pdb)
> /fas_repo/bls_model/1/fas_pipe.py(60)face_detect()
-> inputs=[images],
(Pdb)
> /fas_repo/bls_model/1/fas_pipe.py(61)face_detect()
-> requested_output_names=self.outputs_0)
(Pdb)
> /fas_repo/bls_model/1/fas_pipe.py(58)face_detect()
-> infer_request = pb_utils.InferenceRequest(
(Pdb)
> /fas_repo/bls_model/1/fas_pipe.py(62)face_detect()
-> infer_response = infer_request.exec()
(Pdb)
> /fas_repo/bls_model/1/fas_pipe.py(65)face_detect()
-> confs = pb_utils.get_output_tensor_by_name(infer_response, 'class')
(Pdb)
> /fas_repo/bls_model/1/fas_pipe.py(66)face_detect()
-> locs = pb_utils.get_output_tensor_by_name(infer_response, 'bbox')
(Pdb) p confs
<c_python_backend_utils.Tensor object at 0x7f08f1716130>
(Pdb) dir(confs)
['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'as_numpy', 'from_dlpack', 'is_cpu', 'name', 'to_dlpack', 'triton_dtype']
(Pdb) p confs.is_cpu()
False
(Pdb) p confs.as_numpy()
*** c_python_backend_utils.TritonModelException: Tensor is stored in GPU and cannot be converted to NumPy.
(Pdb)

This is the code that I sent the request：

        ......
        import pdb
        pdb.set_trace()
        images = pb_utils.Tensor("images", preprocessed_imgs)
        infer_request = pb_utils.InferenceRequest(
            model_name=self.model_name0,
            inputs=[images],
            requested_output_names=self.outputs_0)
        infer_response = infer_request.exec()
        #if infer_response.has_error():
        #    return False
        confs = pb_utils.get_output_tensor_by_name(infer_response, 'class')
        locs = pb_utils.get_output_tensor_by_name(infer_response, 'bbox')
        ......

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 15 (5 by maintainers)

Most upvoted comments

Any update on this? I am currently blocked by this

harish-headroom on Aug 29, 2022

Looks like the DLPack protocol has changed a bit since we designed this interface in Python backend and Numpy is using a newer version. I’ll file a ticket for improving the DLPack support in Python backend.

Tabrizian on Jun 27, 2022

Hi any update now? Blocked by this issue too.

numpy version: 1.23.3 tritonserver version: 22.09

xiong-qiao on Dec 13, 2022

we can use

infer_request = pb.utils.InferenceRequest(
                             model_name = ...,
                             requested_output_names = [...],
                             inputs = [...],
                            preferred_memory = pb_utils. PreferredMemory(pb_utils.TRITONSERVER_MEMORY_CPU,0)
)
response = infer_request.exec()
tensor_cpu = pb.utils.get_output_tensor_by_name(response,...)
tensor_cpu.as_numpy()

Jackiexiao on Dec 25, 2023

Sorry for the delay. We are working on this ticket and hopefully it should be available soon.

Tabrizian on Mar 22, 2023

There are many circumstances that lead to the BLS output being on GPU. For example, the backend can decide which output to use for GPU tensors. Python backend tries to provide the outputs in the same device it has received the output to avoid data movement.

Tabrizian on Feb 17, 2022