accumulo: Accumulo master hangs after TLS on ZK

After enabling TLS for ZK, I noticed that starting Accumulo master hangs in an intermediate process as shown below and require to kill the process in-order for Accumulo master to start.

[user1@host1 ~]$ jps -m
84118 Main org.apache.accumulo.master.state.SetGoalState NORMAL

[user1@host1 ~]$ kill -9 84118

[user1@host1 ~]$ jps -m
89847 Main master

Below is the jstack output collected during the hang

2020-04-01 12:16:50
Full thread dump OpenJDK 64-Bit Server VM (11.0.6+10-LTS mixed mode, sharing):

Threads class SMR info:
_java_thread_list=0x00000000017945d0, length=18, elements={
0x0000000000715800, 0x0000000000718000, 0x0000000000726800, 0x0000000000728800,
0x000000000072b800, 0x0000000000735800, 0x0000000000858000, 0x0000000000869800,
0x000000000066a800, 0x000000000180f800, 0x00000000018f4000, 0x0000000001d4e000,
0x0000000001d4f800, 0x0000000001fdf800, 0x0000000001ed8800, 0x0000000003281000,
0x0000000003282800, 0x00000000019ed000
}

"Reference Handler" #2 daemon prio=10 os_prio=0 cpu=0.28ms elapsed=179.54s tid=0x0000000000715800 nid=0x5f95 waiting on condition  [0x00007fa1630b4000]
   java.lang.Thread.State: RUNNABLE
	at java.lang.ref.Reference.waitForReferencePendingList(java.base@11.0.6/Native Method)
	at java.lang.ref.Reference.processPendingReferences(java.base@11.0.6/Reference.java:241)
	at java.lang.ref.Reference$ReferenceHandler.run(java.base@11.0.6/Reference.java:213)

   Locked ownable synchronizers:
	- None

"Finalizer" #3 daemon prio=8 os_prio=0 cpu=0.56ms elapsed=179.54s tid=0x0000000000718000 nid=0x5f96 in Object.wait()  [0x00007fa162fb3000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(java.base@11.0.6/Native Method)
	- waiting on <0x000000070000a2b0> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(java.base@11.0.6/ReferenceQueue.java:155)
	- waiting to re-lock in wait() <0x000000070000a2b0> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(java.base@11.0.6/ReferenceQueue.java:176)
	at java.lang.ref.Finalizer$FinalizerThread.run(java.base@11.0.6/Finalizer.java:170)

   Locked ownable synchronizers:
	- None

"Signal Dispatcher" #4 daemon prio=9 os_prio=0 cpu=0.46ms elapsed=179.53s tid=0x0000000000726800 nid=0x5f97 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
	- None

"C2 CompilerThread0" #5 daemon prio=9 os_prio=0 cpu=1496.64ms elapsed=179.53s tid=0x0000000000728800 nid=0x5f98 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   No compile task

   Locked ownable synchronizers:
	- None

"C1 CompilerThread0" #8 daemon prio=9 os_prio=0 cpu=1079.64ms elapsed=179.53s tid=0x000000000072b800 nid=0x5f99 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   No compile task

   Locked ownable synchronizers:
	- None

"Sweeper thread" #9 daemon prio=9 os_prio=0 cpu=72.66ms elapsed=179.53s tid=0x0000000000735800 nid=0x5f9a runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
	- None

"Service Thread" #10 daemon prio=9 os_prio=0 cpu=0.13ms elapsed=179.50s tid=0x0000000000858000 nid=0x5f9b runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
	- None

"Common-Cleaner" #11 daemon prio=8 os_prio=0 cpu=0.43ms elapsed=179.49s tid=0x0000000000869800 nid=0x5f9d in Object.wait()  [0x00007fa15c381000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(java.base@11.0.6/Native Method)
	- waiting on <0x00000007000a7668> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(java.base@11.0.6/ReferenceQueue.java:155)
	- waiting to re-lock in wait() <0x00000007000a7668> (a java.lang.ref.ReferenceQueue$Lock)
	at jdk.internal.ref.CleanerImpl.run(java.base@11.0.6/CleanerImpl.java:148)
	at java.lang.Thread.run(java.base@11.0.6/Thread.java:834)
	at jdk.internal.misc.InnocuousThread.run(java.base@11.0.6/InnocuousThread.java:134)

   Locked ownable synchronizers:
	- None

"DestroyJavaVM" #15 prio=5 os_prio=0 cpu=824.32ms elapsed=178.73s tid=0x000000000066a800 nid=0x5f90 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
	- None

"org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner" #18 daemon prio=5 os_prio=0 cpu=0.16ms elapsed=178.29s tid=0x000000000180f800 nid=0x5fa5 in Object.wait()  [0x00007fa15b4f3000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(java.base@11.0.6/Native Method)
	- waiting on <0x00000007077c4ab8> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(java.base@11.0.6/ReferenceQueue.java:155)
	- waiting to re-lock in wait() <0x00000007077c4ab8> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(java.base@11.0.6/ReferenceQueue.java:176)
	at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:3762)
	at java.lang.Thread.run(java.base@11.0.6/Thread.java:834)

   Locked ownable synchronizers:
	- None

"client DomainSocketWatcher" #19 daemon prio=5 os_prio=0 cpu=1.85ms elapsed=177.83s tid=0x00000000018f4000 nid=0x5fa6 runnable  [0x00007fa15a7cc000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method)
	at org.apache.hadoop.net.unix.DomainSocketWatcher.access$900(DomainSocketWatcher.java:52)
	at org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:503)
	at java.lang.Thread.run(java.base@11.0.6/Thread.java:834)

   Locked ownable synchronizers:
	- None

"org.apache.accumulo.master.state.SetGoalState-SendThread(kn-fix-0:2281)" #25 daemon prio=5 os_prio=0 cpu=142.24ms elapsed=177.47s tid=0x0000000001d4e000 nid=0x5fab waiting on condition  [0x00007fa159ec3000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.6/Native Method)
	- parking to wait for  <0x000000070eddc9d0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.6/LockSupport.java:234)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.6/AbstractQueuedSynchronizer.java:2123)
	at java.util.concurrent.LinkedBlockingDeque.pollFirst(java.base@11.0.6/LinkedBlockingDeque.java:513)
	at java.util.concurrent.LinkedBlockingDeque.poll(java.base@11.0.6/LinkedBlockingDeque.java:675)
	at org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:278)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1223)

   Locked ownable synchronizers:
	- None

"org.apache.accumulo.master.state.SetGoalState-EventThread" #26 daemon prio=5 os_prio=0 cpu=1.06ms elapsed=177.47s tid=0x0000000001d4f800 nid=0x5fac waiting on condition  [0x00007fa159dc2000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.6/Native Method)
	- parking to wait for  <0x000000070ee25888> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.6/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.6/AbstractQueuedSynchronizer.java:2081)
	at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.6/LinkedBlockingQueue.java:433)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)

   Locked ownable synchronizers:
	- None

"nioEventLoopGroup-2-1" #27 prio=10 os_prio=0 cpu=776.37ms elapsed=177.24s tid=0x0000000001fdf800 nid=0x5fae runnable  [0x00007fa15b917000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPoll.wait(java.base@11.0.6/Native Method)
	at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@11.0.6/EPollSelectorImpl.java:120)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@11.0.6/SelectorImpl.java:124)
	- locked <0x000000070edab1a0> (a io.netty.channel.nio.SelectedSelectionKeySet)
	- locked <0x000000070ed9edb8> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(java.base@11.0.6/SelectorImpl.java:141)
	at io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:68)
	at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:803)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:457)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(java.base@11.0.6/Thread.java:834)

   Locked ownable synchronizers:
	- None

"org.apache.accumulo.master.state.SetGoalState-SendThread(kn-fix-4:2281)" #28 daemon prio=5 os_prio=0 cpu=12.24ms elapsed=176.19s tid=0x0000000001ed8800 nid=0x5fb7 waiting on condition  [0x00007fa159357000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.6/Native Method)
	- parking to wait for  <0x000000070efd03b0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.6/LockSupport.java:234)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.6/AbstractQueuedSynchronizer.java:2123)
	at java.util.concurrent.LinkedBlockingDeque.pollFirst(java.base@11.0.6/LinkedBlockingDeque.java:513)
	at java.util.concurrent.LinkedBlockingDeque.poll(java.base@11.0.6/LinkedBlockingDeque.java:675)
	at org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:278)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1223)

   Locked ownable synchronizers:
	- None

"org.apache.accumulo.master.state.SetGoalState-EventThread" #29 daemon prio=5 os_prio=0 cpu=0.34ms elapsed=176.19s tid=0x0000000003281000 nid=0x5fb8 waiting on condition  [0x00007fa159256000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.6/Native Method)
	- parking to wait for  <0x000000070efd17e8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.6/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.6/AbstractQueuedSynchronizer.java:2081)
	at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.6/LinkedBlockingQueue.java:433)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)

   Locked ownable synchronizers:
	- None

"nioEventLoopGroup-3-1" #30 prio=10 os_prio=0 cpu=342.16ms elapsed=176.18s tid=0x0000000003282800 nid=0x5fb9 runnable  [0x00007fa159155000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPoll.wait(java.base@11.0.6/Native Method)
	at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@11.0.6/EPollSelectorImpl.java:120)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@11.0.6/SelectorImpl.java:124)
	- locked <0x000000070efcf058> (a io.netty.channel.nio.SelectedSelectionKeySet)
	- locked <0x000000070efcee30> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(java.base@11.0.6/SelectorImpl.java:141)
	at io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:68)
	at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:803)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:457)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(java.base@11.0.6/Thread.java:834)

   Locked ownable synchronizers:
	- None

"Attach Listener" #31 daemon prio=9 os_prio=0 cpu=0.74ms elapsed=0.14s tid=0x00000000019ed000 nid=0x6153 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
	- None

"VM Thread" os_prio=0 cpu=154.07ms elapsed=179.54s tid=0x0000000000712800 nid=0x5f94 runnable  

"GC Thread#0" os_prio=0 cpu=28.68ms elapsed=179.56s tid=0x000000000067e800 nid=0x5f91 runnable  

"CMS Main Thread" os_prio=0 cpu=1010.50ms elapsed=179.55s tid=0x00000000006ee800 nid=0x5f93 runnable  

"CMS Thread#0" os_prio=0 cpu=1.06ms elapsed=179.55s tid=0x00000000006ec000 nid=0x5f92 runnable  

"CMS Thread#1" os_prio=0 cpu=1.07ms elapsed=177.50s tid=0x0000000001d34000 nid=0x5faa runnable  

"VM Periodic Task Thread" os_prio=0 cpu=103.05ms elapsed=179.49s tid=0x0000000000873800 nid=0x5f9c waiting on condition  

JNI global refs: 19, weak refs: 0

I & @keith-turner discussed this issue & looked at the jstack output, noticed there are 2 non-daemon threads (no. 27 & 30). As per the description on this link - JVM exits when the only thread running are all daemon threads. Based on which we wanted to check with ZK dev team if those 2 threads are expected to be non-daemon threads.

Let me know if anyone else noticed this issue? I’m using ZK 3.5.7, Accumulo 2.0, Hadoop 3.2.1.

Thanks Karthick

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (17 by maintainers)

Commits related to this issue

Most upvoted comments

I & @keith-turner made changes to SingletonManager & SetGoalState Java class as described here & did a quick test with the new jars. When we started Accumulo master it now starts successfully without hang. I’ll put these new jars cluster wide and perform some more tests and keep you all posted. Thanks

@karthick-rn BTW, thanks for your very excellent, and detailed guide for reproducing! I haven’t had a chance to try it, but your guide left me with zero questions as to how to proceed. 😺 Once this is all figured out and working, that would make an excellent blog post (and possibly an integration test for accumulo-minicluster).

Following the changes done on SingletonManager & SetGoalState Java class, the patch was tested cluster wide & we are able to successfully start all Accumulo services including the master service without any hiccups. However, stopping the services exhibit similar hang and @keith-turner helped me with the code change on ZooZap & Admin Java class to fix those. With all these changes in-place, we’re now able to start/stop Accumulo successfully. The next step would be to land these changes to the master branch. On a side note, noticed the below warning message appear on the console during start & stop of services. May be I’ll raise a separate issue to track what is causing this. Thanks! 2020-04-09 16:09:02,562 [zookeeper.ZooCache] WARN : Unhandled: WatchedEvent state:Closed type:None path:null