jetcd: Unable to catch etcd cluster failure

I’m using the 0.3.0 version of jetcd to observe an etcd cluster composed of 3 nodes in Docker. After setting up the Watch object, I tried to stop all 3 nodes so that I could handle errors within my microservice. At this point two things happened:

The Watch object simply did not react to this change for a long time: no exceptions were raised. I wonder how jetcd checks the health of the etcd cluster: is there a mechanism under the hood similar to heartbeat that I can configure to capture the cluster’s unavailability much faster?
After a while I received a NullPointerException launched on WatchImpl’s line 250 within the onError method (stream.onCompleted ()):

2019-03-16 18:33:18.168 ERROR 1 --- [ault-executor-1] io.grpc.internal.SerializingExecutor     : Exception while executing runnable io.grpc.inte
rnal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed@3e14b715
java.lang.NullPointerException: null
      at io.etcd.jetcd.WatchImpl$WatcherImpl.onError(WatchImpl.java:250) ~[jetcd-core-0.3.0.jar:na]
      at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:434) ~[grpc-stub-1.17.1.jar:1.17.1]
      at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-core-1.17.

      at io.etcd.jetcd.ClientConnectionManager$AuthTokenInterceptor$1$1.onClose(ClientConnectionManager.java:302) ~[jetcd-core-0.3.0.jar:na]
      at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-core-1.17.

      at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:694) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-core-1.17.

      at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584) ~[grpc-core-1.17.1.jar:1.1

      at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.17.1.jar:1.17.1]
      at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) ~[grpc-core-1.17.1.jar:1.17.1]
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_191]
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_191]
      at java.lang.Thread.run(Thread.java:748) [na:1.8.0_191]

It is rare that a system distributed over different geographical areas is completely unavailable but in this scenario, I still want to be able to cleanly manage the errors within my application. Thank you

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 22 (4 by maintainers)

Commits related to this issue

Unable to catch etcd cluster failure #532 — committed to lburgazzoli/etcd-io-jetcd by lburgazzoli 5 years ago
Unable to catch etcd cluster failure #532 — committed to etcd-io/jetcd by lburgazzoli 5 years ago
Fix issue #532 — committed to etcd-io/jetcd by Excpt0r 4 years ago

Most upvoted comments

I am having a similar issue. It is not clear to me how connection issues should be handled by a user of jetcd. I believe that the onError in watchImpl checks for a halt error or no leader error, neither of which match because the cluster is entirely unavailable. This would be useful to propagate to the caller because the caller may need to handle not having connectivity and act accordingly.

I’m not sure why stream.onCompleted() winds up being a null pointer exception, but it appears to stop the process of attempting to reconnect. This means the client will never connect after this error. Is there a way for a caller to check if the connection is still trying to be established? It may be beneficial for the caller to wait to complete some other work until after the connection is established.

sysadmind on Jun 13, 2019