accumulo: OutOfMemoryError During BinMutations cause stuck BatchWriter

Problem The first step in flushing mutations from the Batch Writer to the Tablet Servers involves binning mutations by destination tablet server. If the tablet server throws an OutOfMemoryError or any non-Exception based Throwable, the error will be thrown on the client side. The error will not be caught and will cause the bin thread to die without reporting the fact that an error occurred. This is unrecoverable and leaves the batch writer stuck waiting for mutations get flushed that never will.

Affected versions 1.10.x, main

To Reproduce Not easily reproducible. One could write a rogue iterator that throws an OutOfMemoryError when the metadata table is being scanned.

Bug Location https://github.com/apache/accumulo/blob/5f20e38628abd6385a5ee652634f4d748ecd013f/core/src/main/java/org/apache/accumulo/core/clientImpl/TabletServerBatchWriter.java#L706

this: catch (Exception e) { should be changed to: catch (Throwable e) {

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 27 (27 by maintainers)

Commits related to this issue

Most upvoted comments

#2554 includes a change in 2.1.0 that provides a default UncaughtExceptionHandler in the client that only logs exceptions / errors, it does not terminate the VM like the AccumuloUncaughtExceptionHandler does on the server side. #2554 also provides the user the ability to supply their own UncaughtExceptionHandler implementation.