serve: Java index out of bounds exception when running many requests through server
Context
Trying to loadtest a torch serve model to gauge performance on a custom handler.
- torchserve version: 0.2.0
- torch version: 1.6.0
- java version: openjdk 11.0.8
- Operating System and version: Debian via the python 3.7-buster image.
Your Environment
- Are you planning to deploy it using docker container? [yes/no]: yes
- Is it a CPU or GPU environment?: CPU
- Using a default/custom handler? custom
- What kind of model is it e.g. vision, text, audio?: feed forward for custom input.
- Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? from model store
- Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs: number of netty threads=32
Expected Behavior
Expected torch serve not to throw this error or understand what properties of the environment I could change to address it. It only seems to happen on medium load.
Current Behavior
With a load of ~5rps and varying batch size and CPU memory and count allocations the server will throw an errors in ~4%+ of requests.
Failure Logs [if any]
2020-10-17 00:16:41,887 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9002 Worker disconnected. WORKER_MODEL_LOADED 2020-10-17 00:16:41,887 [ERROR] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - Unknown exception io.netty.handler.codec.DecoderException: java.lang.IndexOutOfBoundsException: readerIndex(1021) + length(4) exceeds writerIndex(1024): PooledUnsafeDirectByteBuf(ridx: 1021, widx: 1024, cap: 1024) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:471) at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:404) at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:371) at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:354) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901) at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.lang.IndexOutOfBoundsException: readerIndex(1021) + length(4) exceeds writerIndex(1024): PooledUnsafeDirectByteBuf(ridx: 1021, widx: 1024, cap: 1024) at io.netty.buffer.AbstractByteBuf.checkReadableBytes0(AbstractByteBuf.java:1477) at io.netty.buffer.AbstractByteBuf.readInt(AbstractByteBuf.java:810) at org.pytorch.serve.util.codec.ModelResponseDecoder.decode(ModelResponseDecoder.java:56) at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501)
Thank you in advance for any help you can provide!
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 38 (16 by maintainers)
@harshbafna I was debugging this further and observed the following: Python backend sends the complete response of all the batched request , but when the frontend server gets its , its fragemented. Example for the below scenario , for the total response size of 500777 , the Message decoder gets the fragments
I suspect the issue is caused due to incorrect decoding of these fragments , What are your throught on this ? Shouldnt the reassembly of these fragments be done at a lower level and then be decoded at the application level ?
@harshbafna
Input to test:
This should return the following response:
config.properties: inference_address=http://0.0.0.0:8080 management_address=http://0.0.0.0:8081 metrics_address=http://0.0.0.0:8082 number_of_netty_threads=32 job_queue_size=1000 model_store=/home/model-server/model-store
So I re-ran with a batch size > 1 and the logging enabled. I am seeing some non-alpha numeric characters for the batch output: “:”, “)”, “(”, “>”, and “-” are all present in the output. However, these are also present in the batch_size=1 run. Not sure if this could cause the issue. Other than that, the shapes of the outputs all look correct batch-wise.
When I run it through the model for just a single request, the one-batch request also looks the same with the 0-th dimension just being length 1 instead of batch_size.
I am doing the dimension handling myself. For example, when I handling the input data, I run a for-loop over the items in the batch in my pre-processing (assuming each item is a request body) and then
torch.cat
along dimension 0 for them. So then the output tensors have shape [batch-size, D], which I pass along.It would be great if you could help bring this to closure 😃.
@harshbafna good call on the batch size check… batch_size=1 appears to totally fix the issue. The model is small enough to run sub 50ms latency without batching. I’ll test with the logging statement next and batch_size > 1.
Fine to close this issue if you’d like to stop investigating for now as this solves my immediate need. However, I’m also more than happy to continue helping try and uncover what is causing the netty thread issue for larger batch sizes. Let me know.