serve: Torchserve docker fails to run on existing mar file

Context

I ran the torch model archiver on a different machine to create a mar with a custom handler for transformer model using this command:

torch-model-archiver --model-name TranslationClassifier --version 1.0 --serialized-file /home/ayush/transformer_model/pytorch_model.bin --handler ./translation_model/text_handler.py --extra-files "./transformer_model/config.json,./transformer_model/special_tokens_map.json,./transformer_model/tokenizer_config.json,./transformer_model/sentencepiece.bpe.model"

It took about 20 mins and the mar file was created correctly. I was able to locally verify torch serve indeed works on that system using the following command:

torchserve --start --model-store model_store --models my_tc=TranslationClassifier.mar

Expected Behavior

In order to run this on kubernetes, I took the pre-existing pytorch/torchserve:latest-gpu image from docker hub, so that I can run in a different environment by leveraging the mar file directly using this command:

sudo docker run -p 8080:8080 -p 8081:8081 -p 8082:8082 -p 7070:7070 -p 7071:7071 --mount type=bind,source=/home/ayush,target=/home/ayush/model_store pytorch/torchserve:latest-gpu torchserve --model-store /home/ayush/model_store --models my_tc=TranslationClassifier.mar

Current Behavior

The execution fails when running that docker container with the following error logs:

2021-03-12 21:13:43,128 [INFO ] W-9002-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 182, in <module>
2021-03-12 21:13:43,128 [INFO ] W-9002-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     worker.run_server()
2021-03-12 21:13:43,128 [INFO ] W-9002-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 154, in run_server
2021-03-12 21:13:43,128 [INFO ] W-9002-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     self.handle_connection(cl_socket)
2021-03-12 21:13:43,128 [INFO ] W-9002-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 116, in handle_connection
2021-03-12 21:13:43,129 [INFO ] W-9002-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     service, result, code = self.load_model(msg)
2021-03-12 21:13:43,129 [INFO ] W-9002-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 89, in load_model
2021-03-12 21:13:43,129 [INFO ] W-9002-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     service = model_loader.load(model_name, model_dir, handler, gpu, batch_size, envelope)
2021-03-12 21:13:43,129 [INFO ] W-9002-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/ts/model_loader.py", line 83, in load
2021-03-12 21:13:43,130 [INFO ] W-9002-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     module = self._load_default_handler(handler)
2021-03-12 21:13:43,130 [INFO ] W-9002-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/usr/local/lib/python3.6/dist-packages/ts/model_loader.py", line 120, in _load_default_handler
2021-03-12 21:13:43,130 [INFO ] W-9002-my_tc_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     module = importlib.import_module(module_name, 'ts.torch_handler')
2021-03-12 21:13:43,131 [INFO ] epollEventLoopGroup-5-7 org.pytorch.serve.wlm.WorkerThread - 9002 Worker disconnected. WORKER_STARTED

Steps to Reproduce

  1. Run torch-model-archiver on a model and take the mar file into a different machine
  2. Run the docker image for pytorch serve with that existing mar file. It looks like its unable to find the customer handler that was used while running torch model archiver. My understanding is the mar file should have captured this information and when running the torchserve --start --model-store model_store --models my_tc=TranslationClassifier.mar in a different environment, it should run out of the box and not failure to recognize the custom handler …

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 10
  • Comments: 15

Most upvoted comments

@ayushch3 @kqhuynguyen pls add install_py_dep_per_model=true in config.properties if your model needs install package. And then copy or attach the config.properties to your docker container.

@dhanainme Just to clarify, all these issues only occur when trying to run the torchserve inside a docker image, there are no issues when running in a standalone ubuntu system. The docker image is necessary to deploy this as a microservice, but the torchserve just fails without emiting any failure logs, so its hard to debug what’s going wrong