pytriton: Error deploying model on Vertex AI

Description

Hi! I’m trying to deploy a StableDiffusion model in GCP Vertex AI using Pytriton backend. My code works on a local machine, and I’ve been able to send requests and receive inference responses.

My problem arrives when I’m trying to create an endpoint using Vertex AI. Server run fails with error:

WARNING - pytriton.server.triton_server: Triton Inference Server exited with failure. Please wait.

And then:

failed to start Vertex AI service: Invalid argument - Expect the model repository contains only a single model if default model is not specified
...
raise PyTritonClientTimeoutError("Waiting for server to be ready timed out.")

Don’t know if the error with Vertex AI service is due to the server first crashing or vice-versa.

To reproduce

Attaching my server code

# server
import torch
from model import ModelWrapper
from pytriton.decorators import batch
from pytriton.model_config import DynamicBatcher, ModelConfig, Tensor
from pytriton.triton import Triton, TritonConfig, Triton
from urllib.parse import urlparse

class _InferFuncWrapper:
    """
    Class wrapper of inference func for triton. Used to also store the model variable
    """

    def __init__(self, model: torch.nn.Module):
        self._model = model

    @batch
    def __call__(self, **inputs) -> np.ndarray:
        """
        Main inference function for triton backend. Called after batch inference.
        Performs all the logic of decoding inputs, calling the model and returning
        outputs.

        Args:
            prompts: Batch of strings with the user prompts
            init_images: Batch of initial image to run the diffusion

        Returns
            image: Batch of generated images
        """
        (prompts, init_images) = inputs.values()
        # decode prompts and images
        prompts = [np.char.decode(p.astype("bytes"), "utf-8").item() for p in prompts]
        init_images = [
            np.char.decode(enc_img.astype("bytes"), "utf-8").item()
            for enc_img in init_images
        ]
        init_images = [_decode_img(enc_img) for enc_img in init_images]
        # transform image arrays to tensors and adjust dims to torch usage
        images_tensors = torch.tensor(init_images, dtype=torch.float32).permute(
            0, 3, 1, 2
        )
        LOGGER.debug(f"Prompts: {prompts}")
        LOGGER.debug(f"{len(init_images)} images size: {init_images[0].shape}")
        LOGGER.info("Generating images...")
        # call diffusion model
        outputs = self._model.run(prompts, images_tensors)
        LOGGER.debug(f"Prepared batch response of size: {len(outputs)}")
        return {"image": np.array(outputs)}


def _parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--verbose",
        "-v",
        action="store_true",
        help="Enable verbose logging in debug mode.",
        default=True,
    )

    parser.add_argument(
        "--vertex",
        "-s",
        action="store_true",
        help="Enable copying model files from storage for vertex deployment",
        default=False,
    )

    return parser.parse_args()

def main():
    """Initialize server with model."""
    args = _parse_args()

    # initialize logging
    log_level = logging.DEBUG if args.verbose else logging.INFO
    logging.basicConfig(
        level=log_level, format="%(asctime)s - %(levelname)s - %(name)s: %(message)s"
    )

    if args.vertex:
        LOGGER.debug("Vertex: Loading pipeline from Vertex Storage")
        storage_path = os.environ["AIP_STORAGE_URI"]
    else:
        LOGGER.debug("Loading pipeline locally")
        storage_path = ("") # Path to local files
    
    bucket_name, subdirectory = parse_path(storage_path)
    LOGGER.debug(f"Downloading files... Started at: {time.strftime('%X')}")
    download_blob(bucket_name, subdirectory)
    LOGGER.debug(f"Files downloaded! Finished at: {time.strftime('%X')}")
    folder_path = os.path.join("src", subdirectory)

    LOGGER.debug(f"Running on device: {DEVICE}, dtype: {DTYPE}, triton_port:{PORT}")
    LOGGER.info("Loading pipeline...")
    model = ModelWrapper(logger=LOGGER, folder_path=folder_path)
    LOGGER.info("Pipeline loaded!")

    log_verbose = 1 if args.verbose else 0

    config = TritonConfig(http_port=8015, exit_on_error=True, log_verbose=log_verbose)

    with Triton(config=config) as triton:
        # bind the model with its inference call and configuration
        triton.bind(
            model_name="StableDiffusion_Img2Img",
            infer_func=_InferFuncWrapper(model=model),
            inputs=[
                Tensor(name="prompt", dtype=np.bytes_, shape=(1,)),
                Tensor(name="init_image", dtype=np.bytes_, shape=(1,)),
            ],
            outputs=[
                Tensor(name="image", dtype=np.bytes_, shape=(1,)),
            ],
            config=ModelConfig(
                max_batch_size=4,
                batcher=DynamicBatcher(
                    max_queue_delay_microseconds=100,
                ),
            ),
            strict=True,
        )
        # serve the model for inference
        triton.serve()


if __name__ == "__main__":
    main()

When creating Vertex endpoint, server predict route is configured to: /v2/models/StableDiffusion_Img2Img/infer

And server health route is configured to: /v2/health/live

With Vertex port=8015, same as HTTP port set in model configuration.

Observed results and expected behavior

As stated, server runs on local machine, but fails initializing endpoint in Vertex AI. During Vertex build, local files are correctly downloaded and model pipeline is loaded, so error is probably in triton.bind() function. Attaching complete log output:

DEBUG - StableDiffusion_Img2Img.server: Files downloaded! Finished at: 18:28:01
DEBUG - StableDiffusion_Img2Img.server: Running on device: cuda, dtype: torch.float16, triton_port:8015
INFO - StableDiffusion_Img2Img.server: Loading pipeline..
INFO - StableDiffusion_Img2Img.server: Pipeline loaded!

...
2023-11-23 18:29:10,322 - DEBUG - pytriton.triton: Triton Inference Server binaries ready in /root/.cache/pytriton/workspace_y7vpgv3x/tritonserver
2023-11-23 18:29:10,322 - DEBUG - pytriton.utils.distribution: Obtained pytriton module path: /usr/local/lib/python3.10/dist-packages/pytriton
 2023-11-23 18:29:10,323 - DEBUG - pytriton.utils.distribution: Obtained nvidia_pytriton.libs path: /usr/local/lib/python3.10/dist-packages/nvidia_pytriton.libs
2023-11-23 18:29:10,323 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://127.0.0.1:8015 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-23 18:29:10,323 - DEBUG - pytriton.client.client: Creating InferenceServerClient for http://127.0.0.1:8015 with {'network_timeout': 60.0, 'connection_timeout': 60.0}
2023-11-23 18:29:10,323 - DEBUG - pytriton.triton: Starting Triton Inference
2023-11-23 18:29:10,324 - DEBUG - pytriton.server.triton_server: Triton Server binary /root/.cache/pytriton/workspace_y7vpgv3x/tritonserver/bin/tritonserver. Environment:
{
...
}
2023-11-23 18:29:10,449 - DEBUG - pytriton.client.utils: Waiting for server to be ready (timeout=119.99996042251587)
2023-11-23 18:29:12,954 - WARNING - pytriton.server.triton_server: Triton Inference Server exited with failure. Please wait
2023-11-23 18:29:12,954 - DEBUG - pytriton.server.triton_server: Triton Inference Server exit code 1
2023-11-23 18:29:12,954 - DEBUG - pytriton.triton: Got callback that tritonserver process finished
2023-11-23 15:31:10.655 Traceback (most recent call last):
2023-11-23 15:31:10.655 File "/home/app/src/server.py", line 200, in <module>
2023-11-23 18:31:10,655 - DEBUG - pytriton.triton: Cleaning model manager, tensor store and workspace.
failed to start Vertex AI service: Invalid argument - Expect the model repository contains only a single model if default model is not specified
2023-11-23 18:31:10,655 - DEBUG - pytriton.utils.workspace: Cleaning workspace dir /root/.cache/pytriton/workspace_y7vpgv3x
raise PyTritonClientTimeoutError("Waiting for server to be ready timed out.")
pytriton.client.exceptions.PyTritonClientTimeoutError: Waiting for server to be ready timed out.

Additional steps taken

From the timeout error raised by Pytriton we’ve tried increasing timeout by setting monitoring_period_s in server.run() to an arbitrary high threshold.

We’ve also tied adapting server configuration to Vertex with:

TritonConfig(http_port=8015, exit_on_error=True, log_verbose=log_verbose, allow_vertex_ai=true, vertex_ai_port=8080)

But getting same error.

Environment

Docker base image: nvcr.io/nvidia/pytorch:23.10-py3 Requierements:

torch @ https://download.pytorch.org/whl/cu116/torch-1.12.1%2Bcu116-cp310-cp310-linux_x86_64.whl
diffusers==0.7.2
transformers==4.21.3
ftfy==6.1.1
importlib-metadata==4.13.0
nvidia-pytriton==0.4.1
Pillow==9.5
google-cloud-storage==2.10.0

Any help is appreciated!!

About this issue

Original URL
State: closed
Created 7 months ago
Reactions: 4
Comments: 16

Most upvoted comments

Thanks @sricke! Let us review that and get back to you.

jkosek on Nov 24, 2023

PyTriton 0.5.2 introduced support for VertexAI. See example for mode details.

piotrm-nvidia on Mar 19, 2024