google-cloud-java: LoggingHandler hangs the thread
Our application is running on Google Kubernetes Engine using gcr.io/google-appengine/jetty
image and used com.google.cloud.logging.LoggingHandler
to publish logs on Stackdriver. We noticed some worker threads becoming unresponsive over time. When the pod is shutting down we can see the following exception for each:
java.lang.RuntimeException: java.lang.InterruptedException
at com.google.cloud.logging.LoggingImpl.flush(LoggingImpl.java:545)
at com.google.cloud.logging.LoggingImpl.write(LoggingImpl.java:525)
at com.google.cloud.logging.LoggingHandler.publish(LoggingHandler.java:273)
at java.util.logging.Logger.log(Logger.java:738)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:582)
at org.slf4j.impl.JDK14LoggerAdapter.error(JDK14LoggerAdapter.java:500)
...
Caused by: java.lang.InterruptedException
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:449)
at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:79)
at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:63)
at com.google.cloud.logging.LoggingImpl.flush(LoggingImpl.java:543)
... 30 more
We’ll try to extract a thread dump to see why the future never completes, but the issue seems dangerous by itself: LoggingImpl.java:543
uses the non-timeout version of Future.get()
which can cause any logger call to block the current thread forever unless interrupted. Would it be possible to use the timeout version with a reasonably big timeout, e.g. 60 seconds?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 21 (7 by maintainers)
Commits related to this issue
- fix #3880 #2796 disable automatic flush on write to avoid hanging and leaking threads as grpc also uses this appender which goes in to deadlock over time. — committed to ajaaym/google-cloud-java by ajaaym 6 years ago
- fix #3880 #2796 disable automatic flush on write to avoid hanging and leaking threads as grpc also uses this appender which goes in to deadlock over time. — committed to ajaaym/google-cloud-java by ajaaym 6 years ago
- Logging: disable severity based flush on write by default (#4254) * fix #3880 #2796 disable automatic flush on write to avoid hanging and leaking threads as grpc also uses this appender which goes in... — committed to googleapis/google-cloud-java by ajaaym 5 years ago
- Merge request (#4436) * Fix: writeAtLeastOnce should use a ReadWriteSession (#3882) * Fix search folder for BQDT synth output (#3878) * Bigtable: Add integration test to test nonexistent row ha... — committed to googleapis/google-cloud-java by snehashah16 5 years ago
Hi,
Having enabled Stackdriver logging (via
spring-cloud-gcp-starter-logging
and gcloud logback appender) about a week ago, we are facing recurring issues there.First of all, it appears that whenever
LoggingImpl
has to flush, it is a blocking synchronous operation from the caller’s perspective. I initially had an assumption thatwriteSynchronicity
(which is set toASYNC
by default for logback) would ensure that all underlying API calls are offloaded to some background processing thread. It does not seem to be the case.It would be great if all asynchronous and blocking operations (and configuration options, when available/applicable) were properly documented. Otherwise it involves a lot of guesswork on the part of library users.
So in order to mitigate sporadic delays caused by gcloud logging appender, we decided to wrap it inside an
AsyncAppender
. Unfortunately, it made matters worse. At some point the cloud backend failed with UNAVAILABLE: Authentication backend unavailable, which completely locked up the application’s processing thread. I’ve done a thread dump (at the bottom of this message), and it looks like we effectively have a deadlock there.So the application is in a state where a recovery is impossible.
My next step in trying to mitigate this will be increasing the queue size for
AsyncAppender
and enabling the neverBlock option on it. Which means some messages will be lost when the gcloud backend chokes up, but at least the application won’t be deadlocked by that:Again, maybe it’s worth updating the documentation (both the project’s readme file, and the stackdriver docs) to include the recommended
AsyncAppender
settings.And here comes a thread dump:
This seems related to #1795. The underlying client provides no way to initiate a flush.
@sparhomenko Do you have a sense of how long we’re waiting? The above bug might cause us to wait a few seconds, but it shouldn’t cause a prolonged hang. I agree that the wait with no timeout is a problem, but I want to understand the problem better before sending fixes.