accumulo: Tserver wait timeout is not forcing startup to continue

Describe the bug If accumulo is started with the following properties, startup will exceed 5min as max.wait never forces the startup process to continue. master.startup.tserver.avail.min.count=<# of tservers> master.startup.tserver.avail.max.wait=5min

Versions (OS, Maven, Java, and others, as appropriate):

  • Affected version(s) of this project: 1.10.x, 2.1.x

To Reproduce I used Fluo-uno to replicate this bug: Note: The minimum number of tservers available should be set higher than your expected total tserver count.

Accumulo 1.10.2

  1. Add the following properties to conf/accumulo/1/accumulo-site.xml
<property>
   <name>master.startup.tserver.avail.max.wait</name>
   <value>5m</value>
</property>
<property>
    <name>master.startup.tserver.avail.min.count</name>
    <value>3</value>
</property>
  1. Modify the ACCUMULO_VERSION in conf/uno.conf to match 1.10.2. export ACCUMULO_VERSION=${ACCUMULO_VERSION:-1.10.2}

  2. Fetch & Start accumulo source <(./bin/uno env) uno fetch accumulo uno start accumulo

  3. View logs to see startup continuing to be blocked after 300 seconds tail -f install/logs/accumulo/master_<hostname>.log

2023-01-10 03:53:22,451 [master.Master] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 15 sec.
2023-01-10 03:53:52,501 [master.Master] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 45 sec.
2023-01-10 03:54:37,552 [master.Master] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 90 sec.
2023-01-10 03:55:37,602 [master.Master] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 150 sec.
2023-01-10 03:56:52,652 [master.Master] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 225 sec.
2023-01-10 03:58:22,703 [master.Master] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 315 sec.
2023-01-10 04:00:07,753 [master.Master] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 420 sec.

Accumulo 2.1

  1. Add the following properties to fluo-uno/conf/accumulo/2/accumulo.properties
manager.startup.tserver.avail.min.count=3
manager.startup.tserver.avail.max.wait=5m
  1. Fetch & start accumulo source <(./bin/uno env) uno fetch accumulo uno setup accumulo

  2. View logs to see startup continuing to be blocked after 300 seconds cat install/logs/accumulo/manager_<hostname>.log | grep Blocking

2023-01-11T13:46:08,511 [manager.Manager] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 0 sec.
2023-01-11T13:46:23,562 [manager.Manager] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 15 sec.
2023-01-11T13:46:53,612 [manager.Manager] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 45 sec.
2023-01-11T13:47:38,662 [manager.Manager] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 90 sec.
2023-01-11T13:48:38,713 [manager.Manager] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 150 sec.
2023-01-11T13:49:53,763 [manager.Manager] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 225 sec.
2023-01-11T13:51:23,813 [manager.Manager] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 315 sec.
2023-01-11T13:53:08,864 [manager.Manager] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 420 sec.

Expected behavior Given manager.startup.tserver.avail.min.count and manager.startup.tserver.avail.max.wait are set. When tserver.avail.min.count is not reached by the time specified in tserver.avail.max.wait, Then the startup process continues on.

Additional context This is related to issue #3157 but is only focused on the expected behavior of the tserver.avail.max.wait property.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (15 by maintainers)

Commits related to this issue

Most upvoted comments

I agree with @EdColeman that it’s really up for interpretation as to what the property means and my comment was mostly pointing out that as someone who isn’t too familiar with those properties it looked to me at first glance like it might make sense for the system to fail and halt if the minimum servers are not met in the time frame. But if that’s not the intent then that is fine as long as things are well documented.

If the intent is to not stop and to continue then maybe we should add another property as @dlmarion mentioned. I think the halting case is still valid (someone may want to shut down and not continue starting the cluster if the mininum number of servers doesn’t come online as there could be a larger issue) so I think it is worthwhile to add but could be done as another Issue/PR of course.

Maybe a loop here would be better to handle multiple conditions. I think that the loop would need to support:

  • allow for infinite reties
  • be able to ignore / disable # tservers barrier
  • print progress, wait conditions periodically, independent of the wait time.