google-cloud-java: LoggingHandler hangs the thread

Our application is running on Google Kubernetes Engine using gcr.io/google-appengine/jetty image and used com.google.cloud.logging.LoggingHandler to publish logs on Stackdriver. We noticed some worker threads becoming unresponsive over time. When the pod is shutting down we can see the following exception for each:

java.lang.RuntimeException: java.lang.InterruptedException
	at com.google.cloud.logging.LoggingImpl.flush(LoggingImpl.java:545)
	at com.google.cloud.logging.LoggingImpl.write(LoggingImpl.java:525)
	at com.google.cloud.logging.LoggingHandler.publish(LoggingHandler.java:273)
	at java.util.logging.Logger.log(Logger.java:738)
	at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:582)
	at org.slf4j.impl.JDK14LoggerAdapter.error(JDK14LoggerAdapter.java:500)
        ...
Caused by: java.lang.InterruptedException
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:449)
	at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:79)
	at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:63)
	at com.google.cloud.logging.LoggingImpl.flush(LoggingImpl.java:543)
	... 30 more

We’ll try to extract a thread dump to see why the future never completes, but the issue seems dangerous by itself: LoggingImpl.java:543 uses the non-timeout version of Future.get() which can cause any logger call to block the current thread forever unless interrupted. Would it be possible to use the timeout version with a reasonably big timeout, e.g. 60 seconds?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 21 (7 by maintainers)

Commits related to this issue

Most upvoted comments

Hi,

Having enabled Stackdriver logging (via spring-cloud-gcp-starter-logging and gcloud logback appender) about a week ago, we are facing recurring issues there.

First of all, it appears that whenever LoggingImpl has to flush, it is a blocking synchronous operation from the caller’s perspective. I initially had an assumption that writeSynchronicity (which is set to ASYNC by default for logback) would ensure that all underlying API calls are offloaded to some background processing thread. It does not seem to be the case.

It would be great if all asynchronous and blocking operations (and configuration options, when available/applicable) were properly documented. Otherwise it involves a lot of guesswork on the part of library users.

So in order to mitigate sporadic delays caused by gcloud logging appender, we decided to wrap it inside an AsyncAppender. Unfortunately, it made matters worse. At some point the cloud backend failed with UNAVAILABLE: Authentication backend unavailable, which completely locked up the application’s processing thread. I’ve done a thread dump (at the bottom of this message), and it looks like we effectively have a deadlock there.

  • The application thread is waiting on async appender, which has its buffer full and is blocked
  • The working thread of async appender is waiting for the gcloud logger to flush
  • The gRPC thread failed with an exception and tried to log that exception, which also resulted in it being blocked on async appender call!

So the application is in a state where a recovery is impossible.

My next step in trying to mitigate this will be increasing the queue size for AsyncAppender and enabling the neverBlock option on it. Which means some messages will be lost when the gcloud backend chokes up, but at least the application won’t be deadlocked by that:

<configuration>

    <springProperty name="APP_NAME" source="spring.application.name" default="spring"/>

    <appender name="CLOUD" class="com.google.cloud.logging.logback.LoggingAppender">
        <flushLevel>WARN</flushLevel>
        <log>${APP_NAME}</log>
    </appender>
    <appender name="ASYNC_CLOUD" class="ch.qos.logback.classic.AsyncAppender">
        <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
            <level>INFO</level>
        </filter>
        <!-- May result in message loss when cloud backend is unavailable, but won't block the application -->
        <neverBlock>true</neverBlock>
        <queueSize>2048</queueSize>
        <appender-ref ref="CLOUD"/>
    </appender>

    <root level="DEBUG">
        <appender-ref ref="ASYNC_CLOUD"/>
        <!-- Other appenders here... -->
    </root>

</configuration>

Again, maybe it’s worth updating the documentation (both the project’s readme file, and the stackdriver docs) to include the recommended AsyncAppender settings.

And here comes a thread dump:

"http-nio-8080-exec-1" #58 daemon prio=5 os_prio=0 tid=0x00007f388ff7e000 nid=0x754d waiting on condition [0x00007f3834
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000733d7e4b0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionO
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:
        at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:353)
        at ch.qos.logback.core.AsyncAppenderBase.putUninterruptibly(AsyncAppenderBase.java:181)
        at ch.qos.logback.core.AsyncAppenderBase.put(AsyncAppenderBase.java:172)
        at ch.qos.logback.core.AsyncAppenderBase.append(AsyncAppenderBase.java:161)
        at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:84)
        at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51)
        at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:270)
        at ch.qos.logback.classic.Logger.callAppenders(Logger.java:257)
        at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:421)
        at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:383)
        at ch.qos.logback.classic.Logger.warn(Logger.java:688)
        at <application code>

"AsyncAppender-Worker-ASYNC_CLOUD" #15 daemon prio=5 os_prio=0 tid=0x00007f388cd28000 nid=0x7512 waiting on condition [
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000079bba6b18> (a com.google.common.util.concurrent.CollectionFuture$ListFuture)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:471)
        at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:78)
        at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:62)
        at com.google.cloud.logging.LoggingImpl.flush(LoggingImpl.java:544)
        at com.google.cloud.logging.LoggingImpl.write(LoggingImpl.java:526)
        at com.google.cloud.logging.logback.LoggingAppender.append(LoggingAppender.java:201)
        at com.google.cloud.logging.logback.LoggingAppender.append(LoggingAppender.java:63)
        at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:84)
        at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51)
        at ch.qos.logback.core.AsyncAppenderBase$Worker.run(AsyncAppenderBase.java:290)
   Locked ownable synchronizers:
        - None

"Gax-2" #16 daemon prio=5 os_prio=0 tid=0x00007f3830410000 nid=0x7522 waiting on condition [0x00007f3870f53000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000733d7e4b0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionO
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:
        at java.util.concurrent.ArrayBlockingQueue.put(ArrayBlockingQueue.java:353)
        at ch.qos.logback.core.AsyncAppenderBase.putUninterruptibly(AsyncAppenderBase.java:181)
        at ch.qos.logback.core.AsyncAppenderBase.put(AsyncAppenderBase.java:172)
        at ch.qos.logback.core.AsyncAppenderBase.append(AsyncAppenderBase.java:161)
        at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:84)
        at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51)
        at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:270)
        at ch.qos.logback.classic.Logger.callAppenders(Logger.java:257)
        at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:421)
        at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:383)
        at ch.qos.logback.classic.Logger.log(Logger.java:765)
        at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
        at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
        at java.util.logging.Logger.log(Logger.java:738)
        at java.util.logging.Logger.doLog(Logger.java:765)
        at java.util.logging.Logger.log(Logger.java:876)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:907)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:813)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:677)
        at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:112)
        at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:399)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:902)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:813)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:677)
        at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:134)
        at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:399)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:902)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:813)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:677)
        at com.google.api.core.AbstractApiFuture$InternalSettableFuture.setException(AbstractApiFuture.java:95)
        at com.google.api.core.AbstractApiFuture.setException(AbstractApiFuture.java:77)
        at com.google.api.gax.rpc.BatchedFuture.setException(BatchedFuture.java:55)
        at com.google.api.gax.rpc.BatchedRequestIssuer.sendResult(BatchedRequestIssuer.java:84)
        at com.google.api.gax.rpc.BatchExecutor$1.onFailure(BatchExecutor.java:96)
        at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:61)
        at com.google.common.util.concurrent.Futures$4.run(Futures.java:1126)
        at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:399)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:902)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:813)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:677)
        at com.google.api.core.AbstractApiFuture$InternalSettableFuture.setException(AbstractApiFuture.java:95)
        at com.google.api.core.AbstractApiFuture.setException(AbstractApiFuture.java:77)
        at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.ja
        at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:61)
        at com.google.common.util.concurrent.Futures$4.run(Futures.java:1126)
        at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:399)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:902)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:813)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:677)
        at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:493)
        at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:468)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener
        at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:684)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener
        at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:403)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:29
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   Locked ownable synchronizers:
        - <0x00000007343591f8> (a java.util.concurrent.ThreadPoolExecutor$Worker)

This seems related to #1795. The underlying client provides no way to initiate a flush.

@sparhomenko Do you have a sense of how long we’re waiting? The above bug might cause us to wait a few seconds, but it shouldn’t cause a prolonged hang. I agree that the wait with no timeout is a problem, but I want to understand the problem better before sending fixes.