accumulo: Tserver wait timeout is not forcing startup to continue
Describe the bug
If accumulo is started with the following properties, startup will exceed 5min
as max.wait
never forces the startup process to continue.
master.startup.tserver.avail.min.count=<# of tservers>
master.startup.tserver.avail.max.wait=5min
Versions (OS, Maven, Java, and others, as appropriate):
- Affected version(s) of this project: 1.10.x, 2.1.x
To Reproduce I used Fluo-uno to replicate this bug: Note: The minimum number of tservers available should be set higher than your expected total tserver count.
Accumulo 1.10.2
- Add the following properties to
conf/accumulo/1/accumulo-site.xml
<property>
<name>master.startup.tserver.avail.max.wait</name>
<value>5m</value>
</property>
<property>
<name>master.startup.tserver.avail.min.count</name>
<value>3</value>
</property>
-
Modify the
ACCUMULO_VERSION
inconf/uno.conf
to match 1.10.2.export ACCUMULO_VERSION=${ACCUMULO_VERSION:-1.10.2}
-
Fetch & Start accumulo
source <(./bin/uno env)
uno fetch accumulo
uno start accumulo
-
View logs to see startup continuing to be blocked after 300 seconds
tail -f install/logs/accumulo/master_<hostname>.log
2023-01-10 03:53:22,451 [master.Master] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 15 sec.
2023-01-10 03:53:52,501 [master.Master] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 45 sec.
2023-01-10 03:54:37,552 [master.Master] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 90 sec.
2023-01-10 03:55:37,602 [master.Master] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 150 sec.
2023-01-10 03:56:52,652 [master.Master] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 225 sec.
2023-01-10 03:58:22,703 [master.Master] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 315 sec.
2023-01-10 04:00:07,753 [master.Master] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 420 sec.
Accumulo 2.1
- Add the following properties to
fluo-uno/conf/accumulo/2/accumulo.properties
manager.startup.tserver.avail.min.count=3
manager.startup.tserver.avail.max.wait=5m
-
Fetch & start accumulo
source <(./bin/uno env)
uno fetch accumulo
uno setup accumulo
-
View logs to see startup continuing to be blocked after 300 seconds
cat install/logs/accumulo/manager_<hostname>.log | grep Blocking
2023-01-11T13:46:08,511 [manager.Manager] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 0 sec.
2023-01-11T13:46:23,562 [manager.Manager] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 15 sec.
2023-01-11T13:46:53,612 [manager.Manager] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 45 sec.
2023-01-11T13:47:38,662 [manager.Manager] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 90 sec.
2023-01-11T13:48:38,713 [manager.Manager] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 150 sec.
2023-01-11T13:49:53,763 [manager.Manager] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 225 sec.
2023-01-11T13:51:23,813 [manager.Manager] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 315 sec.
2023-01-11T13:53:08,864 [manager.Manager] INFO : Blocking for tserver availability - need to reach 3 servers. Have 1 Time spent blocking 420 sec.
Expected behavior
Given manager.startup.tserver.avail.min.count
and manager.startup.tserver.avail.max.wait
are set.
When tserver.avail.min.count
is not reached by the time specified in tserver.avail.max.wait
,
Then the startup process continues on.
Additional context
This is related to issue #3157 but is only focused on the expected behavior of the tserver.avail.max.wait
property.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (15 by maintainers)
I agree with @EdColeman that it’s really up for interpretation as to what the property means and my comment was mostly pointing out that as someone who isn’t too familiar with those properties it looked to me at first glance like it might make sense for the system to fail and halt if the minimum servers are not met in the time frame. But if that’s not the intent then that is fine as long as things are well documented.
If the intent is to not stop and to continue then maybe we should add another property as @dlmarion mentioned. I think the halting case is still valid (someone may want to shut down and not continue starting the cluster if the mininum number of servers doesn’t come online as there could be a larger issue) so I think it is worthwhile to add but could be done as another Issue/PR of course.
Maybe a loop here would be better to handle multiple conditions. I think that the loop would need to support: