server: Docker fails to register cuda shared memory

Description

The Triton client is unable to register Cuda shared memory when running the script from the docker command. Although, it works when running the docker in interactive mode.

Triton Information What version of Triton are you using?

Server: nvcr.io/nvidia/tritonserver:21.03-py3 Client: nvcr.io/nvidia/tritonserver:21.03-py3-sdk

Are you using the Triton container or did you build it yourself? I am using the Triton container

Dockerfile.client

FROM  nvcr.io/nvidia/tritonserver:21.03-py3-sdk

RUN apt update && apt install -y libb64-dev ffmpeg

docker-compose.yml

version: '2.3'

services:
  triton-server:
    container_name: triton-server
    image: nvcr.io/nvidia/tritonserver:21.03-py3
    privileged: true
    runtime: nvidia
    shm_size: '2gb'
    ports:
      - "8000:8000"
      - "8001:8001"
      - "8002:8002"
    ipc: host
    ulimits:
      stack: 67108864
      memlock: -1
    environment:
      - LD_PRELOAD=/plugins/liblayerplugin.so
      - log-verbose=4
    command: bash -c "tritonserver --model-repository=/models --strict-model-config=false --grpc-infer-allocation-pool-size=16"

  triton-client:
    container_name: triton-client
    build:
      context: .
    network_mode: 'host'
    working_dir: /app/src
    depends_on:
      - triton-server
    environment:
      - log-verbose=4
    privileged: true
    runtime: nvidia
    shm_size: '2gb'
    command: bash -c "python3 simple_grpc_cudashm_client.py --verbose"

To Reproduce

Build the client container and then, run docker-compose up. The triton-client container will execute the script simple_grpc_cudashm_client.py but it will throw the following error:

unregister_system_shared_memory, metadata ()
triton-client    | 
triton-client    | Unregistered all system shared memory regions
triton-client    | unregister_cuda_shared_memory, metadata ()
triton-client    | 
triton-client    | Unregistered all cuda shared memory regions
triton-client    | register_cuda_shared_memory, metadata ()
triton-client    | name: "output0_data"
triton-client    | raw_handle: "\260iu\001\000\000\000\000\001\000\000\000\000\000\000\000\000\0002\000\000\000\000\000 \003\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\0001\000\000\000\000\000\000\000\242\000\320\301\216\000\000\\\000\000\000\000"
triton-client    | byte_size: 3276800
triton-client    | 
triton-client    | Traceback (most recent call last):
triton-client    |   File "simple_grpc_cudashm_client.py", line 61, in <module>
triton-client    |     triton_client.register_cuda_shared_memory(
triton-client    |   File "/usr/local/lib/python3.8/dist-packages/tritonclient/grpc/__init__.py", line 906, in register_cuda_shared_memory
triton-client    |     raise_error_grpc(rpc_error)
triton-client    |   File "/usr/local/lib/python3.8/dist-packages/tritonclient/grpc/__init__.py", line 61, in raise_error_grpc
triton-client    |     raise get_error_grpc(rpc_error) from None
triton-client    | tritonclient.utils.InferenceServerException: [StatusCode.INVALID_ARGUMENT] failed to register CUDA shared memory region 'output0_data'

The curious thing happens when I run the script from inside the container. If you run the container with docker-compose run triton-client bash and the from the terminal inside the container you execute `python3 simple_grpc_cudashm_client.py --verbose, the client works as expected without errors. This is the output generated in this case:

unregister_system_shared_memory, metadata ()

Unregistered all system shared memory regions
unregister_cuda_shared_memory, metadata ()

Unregistered all cuda shared memory regions
register_cuda_shared_memory, metadata ()
name: "output0_data"
raw_handle: " 3\225\001\000\000\000\000\t\000\000\000\000\000\000\000\000\0002\000\000\000\000\000 \003\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\0001\000\000\000\000\000\000\000\304\000\320\301\216\000\000\\\000\000\000\000"
byte_size: 3276800

Registered cuda shared memory with name 'output0_data'
register_cuda_shared_memory, metadata ()
name: "input0_data"
raw_handle: " 3\225\001\000\000\000\000\t\000\000\000\000\000\000\000\000\0002\000\000\000\000\000 \003\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\0002\000\000\000\000\000\000\000\304\000\320\301\220\000\000\\\000\000\000\000"
byte_size: 3276800

Registered Cuda shared memory with name 'input0_data'

It’s important to notice that running with `docker-compose run triton-client python3 simple_grpc_cudashm_client.py --verbose also generates the same error.

Attachments

Script simple_grpc_cudashm_client.py

import os
import json
import sys
import argparse

import numpy as np
import tritonclient.grpc as grpcclient
from tritonclient import utils
import tritonclient.utils.cuda_shared_memory as cudashm

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-v',
                        '--verbose',
                        action="store_true",
                        required=False,
                        default=False,
                        help='Enable verbose output')

     parser.add_argument('-u',
                        '--url',
                        type=str,
                        required=False,
                        default='localhost:8001',
                        help='Inference server URL. Default is localhost:8001.')

    FLAGS = parser.parse_args()

try:
        triton_client = grpcclient.InferenceServerClient(url=FLAGS.url,
                                                         verbose=FLAGS.verbose)
    except Exception as e:
        print("channel creation failed: " + str(e))
        sys.exit(1)


    triton_client.unregister_system_shared_memory()
    triton_client.unregister_cuda_shared_memory()

    model_name = "test"
    model_version = "latest"

    input_byte_size = 3276800 # 1600x512x4bytes
    output_byte_size = input_byte_size

    shm_op0_handle = cudashm.create_shared_memory_region(
        "output0_data", output_byte_size, 0)
    
    triton_client.register_cuda_shared_memory(
        "output0_data", cudashm.get_raw_handle(shm_op0_handle), 0,
        output_byte_size)
    
    shm_ip0_handle = cudashm.create_shared_memory_region(
        "input0_data", input_byte_size, 0)

    triton_client.register_cuda_shared_memory(
        "input0_data", cudashm.get_raw_handle(shm_ip0_handle), 0,
        input_byte_size)

Does anyone have an idea why this happens when launching the triton-client with docker-compose up?

Thanks

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 42 (12 by maintainers)

Most upvoted comments

I was able to root cause this issue. It looks like the problem is that cudaIpcOpenMemHandle will return invalid context if the handle’s source PID matches the destination process PID. When using Docker, the process ID namespace of both of the container is enabled which will allow reuse of the process IDs. When the interactive mode is used, the PIDs will be different and there will be no issues.

The fix would be to add --pid host flag to both of the containers. I was able to reproduce the issue locally and confirm that adding the flag fixes the issue.

Hi @rmccorm4

I’ve been able to reproduce this issue using the ‘simple’ model, running the docker image nvcr.io/nvidia/tritonserver:22.04-py3. It happens when I run both the server and client in non-interactive mode.

To reproduce, cd to server/docs/examples/model_repository (I ran fetch_models.sh instead of deleting the rest)

Start the triton-server

docker run --gpus all --network host -v $PWD:/mnt nvcr.io/nvidia/tritonserver:22.04-py3 tritonserver --model-repository /mnt

Start the triton-client test

docker run --gpus all --network host nvcr.io/nvidia/tritonserver:22.04-py3-sdk python3 /workspace/install/python/simple_grpc_cudashm_client.py

This fails with the error message:

Traceback (most recent call last):
  File "/workspace/install/python/simple_grpc_cudashm_client.py", line 93, in <module>
    triton_client.register_cuda_shared_memory(
  File "/usr/local/lib/python3.8/dist-packages/tritonclient/grpc/__init__.py", line 1133, in register_cuda_shared_memory
    raise_error_grpc(rpc_error)
  File "/usr/local/lib/python3.8/dist-packages/tritonclient/grpc/__init__.py", line 62, in raise_error_grpc
    raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.INVALID_ARGUMENT] failed to register CUDA shared memory region 'output0_data'

If I start either the server or the triton client in interactive mode, the test passes. For example, I could start the server like this:

docker run -ti --gpus all --network host -v $PWD:/mnt nvcr.io/nvidia/tritonserver:22.04-py3

# Now in docker container
root@<NAME>:/opt/tritonserver# tritonserver --model-repository=/mnt

I have verified this behavior using two machines I am working with.

If it helps, here’s my configuration: nvidia driver: NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 nvidia-docker2: 2.10.0-1 OS: Ubuntu20.04

Hi @Tabrizian I am able to reproduce this, even with the server and client having the `–ipc host

docker run --gpus all --ipc host --network host -v $PWD:/mnt nvcr.io/nvidia/tritonserver:22.04-py3 tritonserver --model-repository /mnt

client

docker run --gpus all --ipc host --network host nvcr.io/nvidia/tritonserver:22.04-py3-sdk python3 /workspace/install/python/simple_grpc_cudashm_client.py

even though I have the last versions of the drivers and docker, my os UBUNTU 18.04.

I suspect something along the lines of cgroups, but would you mind sharing @Tabrizian what tools or method were you able to debug this?

I have looked at journalctl -f and dmesg and no issue whatsover. Any hand?

@Tabrizian So I would need to run both the triton server and the client on a linux machine. Docker containers and/or WSL will not function?

I’m running triton server nvcr.io/nvidia/tritonserver:23.04-py3 and nvcr.io/nvidia/tritonserver:23.04-py3-sdk as docker containers from my windows machine. I’m trying to run the simple_grpc_cudashm_client.py and I get this error:

Traceback (most recent call last):
  File "yolov7/simple_grpc_cudashm_client.py", line 88, in <module>
    triton_client.register_cuda_shared_memory(
  File "/usr/local/lib/python3.8/dist-packages/tritonclient/grpc/__init__.py", line 1300, in register_cuda_shared_memory
    raise_error_grpc(rpc_error)
  File "/usr/local/lib/python3.8/dist-packages/tritonclient/grpc/__init__.py", line 75, in raise_error_grpc
    raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.INVALID_ARGUMENT] failed to register CUDA shared memory region 'output0_data': failed to open CUDA IPC handle: invalid resource handle

I haven’t changed any of the code. Here are my docker startup commands:

docker run --gpus all --rm --ipc=host --pid=host --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v C:\Users\media\Documents\models\triton-model-repository:/models nvcr.io/nvidia/tritonserver:23.04-py3 tritonserver --model-repository=/models --log-verbose 5
docker run -it --rm -v C:\Users\media\Documents\triton-inference:/code --gpus all nvcr.io/nvidia/tritonserver:23.04-py3-sdk

The server has the simple model as ready.

Any help would be appreciated.