bookkeeper: Bookkeeper crashes under high load (TooLongFrameException)
BUG REPORT
Describe the bug
We have been consistently observing (last months) that, under high load situations, Bookies (v4.9.2) may permanently crash. At first glance, the main reason for the problem seems that at some point, messages from Bookkeeper client (v4.9.2) could be malformed/corrupted, leading the Bookie to crash due to unknown op codes (IllegalStateException
) and Netty frames being too long (TooLongFrameException
).
To Reproduce
Whenever we sustain a load close to the max Bookie capacity for a sufficient amount of time, we often find this problem. We have seen it in different environments and with diverse types of drives. Note that the problems is not deterministic, so it may take some time (minutes?) to appear (e.g., depends if the Bookie is also busy with other tasks, like garbage collection), but eventually it appears.
Expected behavior
We expect 3 main behaviors from Bookkeeper:
- If the problem is related to some kind of data malformation/corruption, then it needs to be fixed.
- A back-pressure mechanism to prevent crashes and adjust the client’s write throughput to what the server can absorb.
- In case of a malformed message (i.e.,
TooLongFrameException
), we would expect the server to just discard the message and continue working as usual.
Screenshots n/a
Additional context This problem normally is seen in our benchmark experiments when we induce loads to Bookkeeper close to its maximum capacity. In this scenario, instead of inducing enough back-pressure to the client, what we observe is a problem as follows:
Nov 24 12:18:30 ip-10-0-0-119.us-west-2.compute.internal bookkeeper[17271]: 2020-11-24 12:18:30,193 - ERROR - [bookie-io-1-2:BookieRequestHandler@70] - Unhandled exception occurred in I/O thread or handler
Nov 24 12:18:30 ip-10-0-0-119.us-west-2.compute.internal bookkeeper[17271]: java.lang.IllegalStateException: Received unknown request op code = 84
Nov 24 12:18:30 ip-10-0-0-119.us-west-2.compute.internal bookkeeper[17271]: at org.apache.bookkeeper.proto.BookieProtoEncoding$RequestEnDeCoderPreV3.decode(BookieProtoEncoding.java:194)
Nov 24 12:18:30 ip-10-0-0-119.us-west-2.compute.internal bookkeeper[17271]: at org.apache.bookkeeper.proto.BookieProtoEncoding$RequestDecoder.channelRead(BookieProtoEncoding.java:436)
...
Nov 24 12:18:30 ip-10-0-0-119.us-west-2.compute.internal bookkeeper[17271]: 2020-11-24 12:18:30,196 - ERROR - [bookie-io-1-2:BookieRequestHandler@70] - Unhandled exception occurred in I/O thread or handler
Nov 24 12:18:30 ip-10-0-0-119.us-west-2.compute.internal bookkeeper[17271]: io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 5242880: 1556335546 - discarded
Nov 24 12:18:30 ip-10-0-0-119.us-west-2.compute.internal bookkeeper[17271]: at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:522)
Nov 24 12:18:30 ip-10-0-0-119.us-west-2.compute.internal bookkeeper[17271]: at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:500)
From this error, we extract the following observations:
- The
TooLongFrameException
is indicating that we are sending a message that is too large compared to thenettyMaxFrameSize
configured in the server. However, there is no way in which we are sending a message of 1.5GB in size. The events we are writing to the Bookkeeper client are strictly smaller than 1MB. - The problem seems related to a corruption of the messaged already sent to the Bookkeeper client, given that we also see a problem like
java.lang.IllegalStateException: Received unknown request op code = 84
. This means that the header itself of the message has been corrupted, and I don’t know how a client application could modify that. - Another aspect that points to a problem in the Bookkeeper client (or maybe its interactions with Netty) is that the 3 Bookies used in this experiment got the same exception at the same time, so in my view this discards that the problem is server related.
I’m attaching the Bookkeeper logs of this reproduction of the problem. bookkeeper-10.0.0.119.log.gz bookkeeper-10.0.0.240.log.gz bookkeeper-10.0.0.251.log.gz
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 17 (16 by maintainers)
oh, IIRC @karanmehta93 (or someone from SF) experienced the same problem, at that time the problem was that the application reused the same byte[] (or ByteBuf) that was passed to the BK LedgerHandle. Basically once you pass your byte[] or ByteBuf to BK you cannot use it anymore.