serve: Worker thread stuck in die state

šŸ› Describe the bug

We encountered similar problems as https://github.com/pytorch/serve/issues/1531 and it happens quite often.

See logs below. We have two workers(9000, 9001) for this model. After worker 9000 got an exception, it’s kinda stuck in an unknown state: it didn’t terminate itself, so no workers can be added automatically. But in the meantime, it won’t receive incoming traffic. which essentially means we only have one worker(9001) now.

The problem is that: this worker is in a stuck state: it is not destruct itself(successfully) and it can’t receive any traffic. It still counted one active worker, thus torchserve won’t add more worker(because current # worker=2). Normally the worker would die and torchserve will retry the worker (e.g. found Retry worker: 9001 in 1 seconds. in the log )

If I curl the management API, it still shows two works are all healthy.

Error logs

worker-9000 died because exception. It didn’t have any log after 2022-08-25 21:21:14.056 PDT, selected logs:

[WARN ] W-9000-easyocr_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker thread exception.

io.grpc.StatusRuntimeException: CANCELLED: call already cancelled at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?] at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?] at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?] at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) ~[model-server.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) [?:?]

[INFO ] W-9000-easyocr_1.0-stdout MODEL_LOG - Frontend disconnected. [INFO ] W-9000-easyocr_1.0 ACCESS_LOG - /127.0.0.1:40592 ā€œgRPC org.pytorch.serve.grpc.inference.InferenceAPIsService/Predictions HTTP/2.0ā€ 13 109 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED

[INFO ] W-9000-easyocr_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-easyocr_1.0-stderr

Screen Shot 2022-08-25 at 10 10 38 PM Screen Shot 2022-08-25 at 10 09 24 PM

Installation instructions

N/A unrelated

Model Packaing

N/A unrelated

config.properties

No response

Versions

used docker image of 0.6.0-gpu

Repro instructions

N/A

Possible Solution

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 30 (5 by maintainers)

Most upvoted comments

@msaroufim @lxning

I can successfully repro it in my local with following setup

setup

  • in config.properties, I put min/max worker=2
  • start torchserve in local, waiting for it’s fully up
  • have a simple python GRPC client
import sys
import grpc
import time
import argparse
import inference_pb2
import inference_pb2_grpc
import management_pb2
import management_pb2_grpc
from datetime import datetime
from google.protobuf import empty_pb2


parser = argparse.ArgumentParser()
parser.add_argument("--timeout", required=True)
parser.add_argument("--port", required=True)
parser.add_argument("-n", required=True)


if __name__ == '__main__':
    args, _ = parser.parse_known_args()
    port = args.port
    num_runs = int(args.n)

    with open('image.jpg', 'rb') as f:
        data = f.read()
    input_data = {'data': data}
    request = inference_pb2.PredictionsRequest(model_name="easyocr",
                                               input=input_data)

    channel = grpc.insecure_channel(f'localhost:{port}')
    stub = inference_pb2_grpc.InferenceAPIsServiceStub(channel)
    for i in range(num_runs):
        try:
            start = datetime.now()
            response = stub.Predictions(request, timeout=float(args.timeout))
            print("request time:", datetime.now() - start)
            print(response)
        except Exception as e:
            print(e)

  • in one shell tab, run the client with normal timeout(2s) and very large number of requests, simulating consistent stream of normal requests):
python3 torchserve_grpc_client.py --port=17100 --timeout=2 -n=100000000
  • in another shell tab, run the client with very short timeout(0.02s), simulating some heavy/bad requests that will eventually timeout:
python3 torchserve_grpc_client.py --port=17100 --timeout=0.02 -n=100

Result

it takes torchserve several seconds to several minutes to resolve the issue and tab1 simulation’s output is back to normal, but also likely that all workers will stuck forever and tab1 will see flood of errors.

@hgong-snap confirming we can finally repro, will get back to you with a solution soon

@hgong-snap Great, thank you for the verification.

@lxning verified onpytorch/torchserve-nightly:latest-gpu and seems issue is solved!

@hgong-snap I verified master branch on both local host and docker nightly build.

You can fetch torchserve nightly build at https://hub.docker.com/r/pytorch/torchserve-nightly/tags.

  • docker pull pytorch/torchserve-nightly:latest-gpu
  • docker pull pytorch/torchserve-nightly:latest-cpu

for repro, I guess best way is to cancel the request on client side? or set some short-enough GRPC timeout so that internally GRPC will cancel the request?

With that said,

  1. I can’t know if a worker is in die stuck state or not.
  2. Even if I know a worker is died, I can’t know which one died, nor kill/restart this specific worker

so basically there’s really no way to catch the worker die stuck error?