serve: Model not loaded - MODEL_LOG - what(): std::bad_alloc - Torchserve version: 0.5.2

šŸ› Describe the bug

Torschserve:: Name: torch-model-archiver Version: 0.6.0 The command to create the MAR file: torch-model-archiver --model-name bodypartrec --version 1.0 --serialized-file model.pt --handler model_handler.py --requirements-file requir/requirements.txt

KServe: Inferenceservice:

apiVersion: ā€œserving.kserve.io/v1beta1ā€ kind: ā€œInferenceServiceā€ metadata: namespace: ā€œkserve-testā€ name: ā€œmy-modelā€ spec: predictor: serviceAccountName: kserve-sa pytorch: protocolVersion: ā€œv2ā€ storageUri: ā€œs3://…/kserve/ā€

The error when torchserve is started with kserve is as described in the logs. So basically: ā€œwhat(): std::bad_allocā€ is happening. There are no clues on how to proceed with. I saw that this issue is marked as a bug, but I don’t know in what case this is triggered!? Any help is really appreciated!

Error logs

Defaulted container ā€œkserve-containerā€ out of: kserve-container, queue-proxy, storage-initializer (init) WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. /usr/local/lib/python3.8/dist-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via pip install 'ray[default]'. Please update your install command. warnings.warn( 2022-06-01T13:52:17,164 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager… 2022-06-01T13:52:17,276 [INFO ] main org.pytorch.serve.ModelServer - Torchserve version: 0.5.2 TS Home: /usr/local/lib/python3.8/dist-packages Current directory: /home/model-server Temp directory: /home/model-server/tmp Number of GPUs: 0 Number of CPUs: 24 Max heap size: 512 M Python executable: /usr/bin/python Config file: /mnt/models/config/config.properties Inference address: http://0.0.0.0:8085 Management address: http://0.0.0.0:8085 Metrics address: http://0.0.0.0:8082 Model Store: /mnt/models/model-store Initial Models: N/A Log dir: /home/model-server/logs Metrics dir: /home/model-server/logs Netty threads: 4 Netty client threads: 0 Default workers per model: 24 Blacklist Regex: N/A Maximum Response Size: 6553500 Maximum Request Size: 6553500 Limit Maximum Image Pixels: true Prefer direct buffer: false Allowed Urls: [file://.|http(s)?šŸ˜•/.] Custom python dependency for model allowed: true Metrics report format: prometheus Enable metrics API: true Workflow Store: /mnt/models/model-store Model config: N/A 2022-06-01T13:52:17,283 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin… 2022-06-01T13:52:17,321 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Started restoring models from snapshot {ā€œnameā€:ā€œstartup.cfgā€,ā€œmodelCountā€:1,ā€œmodelsā€:{ā€œmy-modelā€:{ā€œ1.0ā€:{ā€œdefaultVersionā€:true,ā€œmarNameā€:ā€œbodypartrec.marā€,ā€œminWorkersā€:1,ā€œmaxWorkersā€:5,ā€œbatchSizeā€:1,ā€œresponseTimeoutā€:120, ā€œmaxBatchDelayā€:50}}}} 2022-06-01T13:52:17,328 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Validating snapshot startup.cfg 2022-06-01T13:52:17,329 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Snapshot startup.cfg validated successfully [I 220601 13:52:17 main:75] Wrapper : Model names [ā€˜my-model’], inference address http//0.0.0.0:8085, management address http://0.0.0.0:8085, model store /mnt/models/model-store [I 220601 13:52:17 TorchserveModel:54] kfmodel Predict URL set to 0.0.0.0:8085 [I 220601 13:52:17 TorchserveModel:56] kfmodel Explain URL set to 0.0.0.0:8085 [I 220601 13:52:17 TSModelRepository:30] TSModelRepo is initialized [I 220601 13:52:17 kfserver:150] Registering model: my-model [I 220601 13:52:17 kfserver:120] Setting asyncio max_workers as 28 [I 220601 13:52:17 kfserver:127] Listening on port 8080 [I 220601 13:52:17 kfserver:129] Will fork 1 workers 2022-06-01T13:52:17,671 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model my-model 2022-06-01T13:52:17,671 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model my-model 2022-06-01T13:52:50,088 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model my-model 2022-06-01T13:52:50,088 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model my-model loaded. 2022-06-01T13:52:50,088 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: my-model, count: 1 2022-06-01T13:52:50,096 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:52:50,097 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel. 2022-06-01T13:52:50,158 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8085 2022-06-01T13:52:50,159 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel. 2022-06-01T13:52:50,160 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082 Model server started. 2022-06-01T13:52:50,363 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet. 2022-06-01T13:52:50,415 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:40.0|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,417 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:94.23827743530273|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,417 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:774.5930213928223|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,418 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:89.2|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,418 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:23591.0078125|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,418 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:8041.05078125|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,419 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:26.5|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,766 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:52:50,767 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]50 2022-06-01T13:52:50,767 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:52:50,767 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:52:50,768 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change null -> WORKER_STARTED 2022-06-01T13:52:50,774 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:52:50,784 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:52:50,787 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091570787 2022-06-01T13:52:50,819 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:52:50,888 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of ā€˜std::bad_alloc’ 2022-06-01T13:52:50,889 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:52:58,066 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:52:58,066 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:52:58,067 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:52:58,080 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:52:58,080 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:52:58,090 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:52:58,091 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:52:58,091 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:52:58,091 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:52:58,092 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds. 2022-06-01T13:52:59,092 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:52:59,812 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:52:59,813 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]107 2022-06-01T13:52:59,813 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:52:59,813 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-01T13:52:59,813 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:52:59,813 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:52:59,816 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091579815 2022-06-01T13:52:59,816 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:52:59,836 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:52:59,912 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of ā€˜std::bad_alloc’ 2022-06-01T13:52:59,913 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:53:03,968 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:53:03,969 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:53:03,969 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:53:03,972 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:53:03,973 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:53:03,973 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:53:03,973 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:53:03,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:53:03,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:53:03,974 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds. 2022-06-01T13:53:04,974 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:53:05,584 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:05,585 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]135 2022-06-01T13:53:05,585 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:53:05,585 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-01T13:53:05,585 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:53:05,586 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:05,588 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091585588 2022-06-01T13:53:05,588 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:53:05,608 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:53:05,680 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of ā€˜std::bad_alloc’ 2022-06-01T13:53:05,681 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:53:09,625 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:53:09,625 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:53:09,625 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:53:09,632 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:53:09,632 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:53:09,633 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:53:09,633 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:53:09,633 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:53:09,633 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:53:09,633 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 2 seconds. 2022-06-01T13:53:11,634 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:53:12,255 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:12,256 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]163 2022-06-01T13:53:12,256 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:53:12,256 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:53:12,256 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-01T13:53:12,257 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:12,259 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091592259 2022-06-01T13:53:12,259 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:53:12,284 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:53:12,361 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of ā€˜std::bad_alloc’ 2022-06-01T13:53:12,362 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:53:16,966 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:53:16,966 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:53:16,966 [INFO ] epollEventLoopGroup-5-4 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:53:16,973 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:53:16,973 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:53:16,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:53:16,974 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:53:16,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:53:16,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:53:16,974 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 3 seconds. 2022-06-01T13:53:19,975 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:53:20,544 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:20,545 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]191 2022-06-01T13:53:20,545 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:53:20,545 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:53:20,545 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-01T13:53:20,545 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:20,547 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091600547 2022-06-01T13:53:20,547 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:53:20,569 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:53:20,644 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of ā€˜std::bad_alloc’ 2022-06-01T13:53:20,645 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:53:25,278 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:53:25,278 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:53:25,279 [INFO ] epollEventLoopGroup-5-5 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:53:25,286 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:53:25,286 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:53:25,287 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:53:25,287 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:53:25,287 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:53:25,287 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:53:25,288 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 5 seconds. 2022-06-01T13:53:30,288 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:53:31,052 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:31,053 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]219 2022-06-01T13:53:31,053 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:53:31,053 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:53:31,053 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-01T13:53:31,053 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:31,055 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091611055 2022-06-01T13:53:31,055 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:53:31,056 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:53:31,130 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of ā€˜std::bad_alloc’ 2022-06-01T13:53:31,131 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:53:35,936 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:53:35,936 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:53:35,936 [INFO ] epollEventLoopGroup-5-6 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:53:35,941 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:53:35,941 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:53:35,941 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:53:35,941 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:53:35,942 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:53:35,942 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:53:35,942 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 8 seconds.

Installation instructions

No, torchserve is running inside kserve

Model Packaing

step by step with torch-model-archiver

config.properties

install_py_dep_per_model=true

Versions

Name: torch Version: 1.9.0+cu111

Name: torch-model-archiver Version: 0.6.0

Repro instructions

kubectl apply -f bodypart.yaml -n kserve-test

apiVersion: ā€œserving.kserve.io/v1beta1ā€ kind: ā€œInferenceServiceā€ metadata: namespace: ā€œkserve-testā€ name: ā€œmy-modelā€ spec: predictor: serviceAccountName: kserve-sa pytorch: protocolVersion: ā€œv2ā€ storageUri: ā€œs3://…/kserve/ā€

Possible Solution

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (4 by maintainers)

Most upvoted comments

thank you! it works now 😃 it was my mistake, i was referring to kserve/ instead of pytorch/…

No I couldn’t reproduce it without kserve. I tried with different python envs 3.8-3.10, and everything worked fine. The error always happened with kserve.

So I found the trigger to it. Basically I’m using KServe, and torchserve is started in the kserve env. When I created the MAR file I passed the monai==0.8.1 dependency through the –requirements-file requirements.txt, where I also added the install_py_dep_per_model=true in the config.properties file. Testing with the MAR file created this way led to error in the end.

When I created the MAR file without any requirements file then it worked properly with KServe, otherwise as I explained the std::bad_alloc was happening with version 0.5 and 0.6.0 and the other error with older torchserve versions.

So I’m guessing this is due dependencies version conflicts, but I’m not sure.