quartz: The jobs recovering (on scheduler startup) blocks simple trigger after failover situation for job which was executing during JVM crash
http://www.quartz-scheduler.org/documentation/faq.html
What is Quartz? … Quartz is fault-tolerant …
But there is a problem with it in some cases.
Reproduced with Quartz 2.2.1 and 2.2.3 (didn’t check other versions)
Prerequisites
- Job has simple trigger to repeat execution with some interval (e.g. minute).
- Concurrent execution is not allowed.
- Recovery is not requested.
JVM crashes during a job execution (or was stopped for maintenance during a job execution)
Downtime is much bigger than trigger’s interval (e.g. > 2 minutes)
Important Quartz tables are TRIGGERS and FIRED_TRIGGERS and theirs states directly after JVM crashed are:
TRIGGERS table
TRIGGER_NAME | TRIGGER_GROUP | JOB_NAME | JOB_GROUP | NEXT_FIRE_TIME | PREV_FIRE_TIME | TRIGGER_STATE | TRIGGER_TYPE | START_TIME | MISFIRE_INSTR | SCHED_NAME |
---|---|---|---|---|---|---|---|---|---|---|
test | TestJob | test | TestJob | 1481618100000 | 1481618040000 | BLOCKED | SIMPLE | 1481555640000 | 0 | scheduler |
FIRED_TRIGGERS table
ENTRY_ID | TRIGGER_NAME | TRIGGER_GROUP | INSTANCE_NAME | FIRED_TIME | STATE | JOB_NAME | JOB_GROUP | REQUESTS_RECOVERY | SCHED_TIME | IS_NONCONCURRENT | SCHED_NAME |
---|---|---|---|---|---|---|---|---|---|---|---|
NON_CLUSTERED1481617513489 | test | TestJob | NON_CLUSTERED | 1481618049269 | EXECUTING | test | TestJob | 0 | 1481618040000 | 1 | scheduler |
The 1st scheduler start after system crashed (or stopped for maintenance)
Below are SQL statements for recoverJobs procedure on scheduler.start() and states of TRIGGERS and FIRED_TRIGGERS tables
UPDATE TRIGGERS SET TRIGGER_STATE = 'WAITING'
WHERE SCHED_NAME = 'scheduler' AND (TRIGGER_STATE = 'ACQUIRED' OR TRIGGER_STATE = 'BLOCKED')
-> trigger has updated from BLOCKED to WAITING
UPDATE TRIGGERS SET TRIGGER_STATE = 'PAUSED' WHERE SCHED_NAME = 'scheduler' AND (TRIGGER_STATE = 'PAUSED_BLOCKED' OR TRIGGER_STATE = 'PAUSED_BLOCKED')
// not important
SELECT TRIGGER_NAME, TRIGGER_GROUP FROM TRIGGERS
WHERE SCHED_NAME = 'scheduler' AND NOT (MISFIRE_INSTR = -1) AND NEXT_FIRE_TIME < 1481618263782 AND TRIGGER_STATE = 'WAITING'
ORDER BY NEXT_FIRE_TIME ASC, PRIORITY DESC
-> trigger has been selected as misfired in WAITING state
SELECT * FROM TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT * FROM SIMPLE_TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_NAME FROM TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_GROUP FROM PAUSED_TRIGGER_GRPS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_GROUP FROM PAUSED_TRIGGER_GRPS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_GROUP = '_$_ALL_GROUPS_PAUSED_$_'
SELECT * FROM JOB_DETAILS WHERE SCHED_NAME = 'scheduler' AND JOB_NAME = 'test' AND JOB_GROUP = 'TestJob'
// not very important, assume it collects information about misfired trigger and there job
SELECT * FROM FIRED_TRIGGERS
WHERE SCHED_NAME = 'scheduler' AND JOB_NAME = 'test' AND JOB_GROUP = 'TestJob'
-> fired trigger has been selected for misfired trigger
UPDATE TRIGGERS SET JOB_NAME = 'test', JOB_GROUP = 'TestJob', DESCRIPTION = NULL, NEXT_FIRE_TIME = 1481618340000, PREV_FIRE_TIME = 1481618040000,
TRIGGER_STATE = 'BLOCKED', -- !!!! IMPORTANT !!!!
TRIGGER_TYPE = 'SIMPLE', START_TIME = 1481555640000, END_TIME = 0, CALENDAR_NAME = NULL, MISFIRE_INSTR = 0, PRIORITY = 5, JOB_DATA = '<byte[]>'
WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
-> (IMPORTANT) trigger which is misfired and fired (because in execution on JVM crash/stop) at the same time
has been updated to BLOCKED state on scheduler start
UPDATE SIMPLE_TRIGGERS SET REPEAT_COUNT = -1, REPEAT_INTERVAL = 60000, TIMES_TRIGGERED = 1045 WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
// not important
SELECT * FROM FIRED_TRIGGERS WHERE SCHED_NAME = 'scheduler' AND INSTANCE_NAME = 'NON_CLUSTERED' AND REQUESTS_RECOVERY = 1
// not important, assume handling triggers requested recovery
SELECT TRIGGER_NAME, TRIGGER_GROUP FROM TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_STATE = 'COMPLETE'
// not important, assume select to remove stale triggers
DELETE FROM FIRED_TRIGGERS WHERE SCHED_NAME = 'scheduler'
-> fired triggers are removed
TRIGGERS table
TRIGGER_NAME | TRIGGER_GROUP | JOB_NAME | JOB_GROUP | NEXT_FIRE_TIME | PREV_FIRE_TIME | TRIGGER_STATE | TRIGGER_TYPE | START_TIME | MISFIRE_INSTR | SCHED_NAME |
---|---|---|---|---|---|---|---|---|---|---|
test | TestJob | test | TestJob | 1481618100000 | 1481618040000 | BLOCKED | SIMPLE | 1481555640000 | 0 | scheduler |
FIRED_TRIGGERS table
No rows
Problem
Job has repeat trigger but in BLOCKED state, trigger will not fired, job will not executed at least until JVM is not restarted again
The 2nd scheduler start (just for test purposes)
Below are SQL statements for recoverJobs procedure on scheduler.start() and states of TRIGGERS and FIRED_TRIGGERS tables
UPDATE TRIGGERS SET TRIGGER_STATE = 'WAITING'
WHERE SCHED_NAME = 'scheduler' AND (TRIGGER_STATE = 'ACQUIRED' OR TRIGGER_STATE = 'BLOCKED')
-> TRIGGER has updated from BLOCKED to WAITING
UPDATE TRIGGERS SET TRIGGER_STATE = 'PAUSED' WHERE SCHED_NAME = 'scheduler' AND (TRIGGER_STATE = 'PAUSED_BLOCKED' OR TRIGGER_STATE = 'PAUSED_BLOCKED')
// not important
SELECT TRIGGER_NAME, TRIGGER_GROUP FROM TRIGGERS
WHERE SCHED_NAME = 'scheduler' AND NOT (MISFIRE_INSTR = -1) AND NEXT_FIRE_TIME < 1481621819212 AND TRIGGER_STATE = 'WAITING'
ORDER BY NEXT_FIRE_TIME ASC, PRIORITY DESC
-> trigger has been selected as misfired in WAITING state
SELECT * FROM TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT * FROM SIMPLE_TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_NAME FROM TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_GROUP FROM PAUSED_TRIGGER_GRPS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_GROUP FROM PAUSED_TRIGGER_GRPS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_GROUP = '_$_ALL_GROUPS_PAUSED_$_'
SELECT * FROM JOB_DETAILS WHERE SCHED_NAME = 'scheduler' AND JOB_NAME = 'test' AND JOB_GROUP = 'TestJob'
// not very important, assume it collects information about misfired trigger and there job
SELECT * FROM FIRED_TRIGGERS
WHERE SCHED_NAME = 'scheduler' AND JOB_NAME = 'test' AND JOB_GROUP = 'TestJob'
-> fired trigger has been selected for misfired trigger
UPDATE TRIGGERS SET JOB_NAME = 'test', JOB_GROUP = 'TestJob', DESCRIPTION = NULL, NEXT_FIRE_TIME = 1481621880000, PREV_FIRE_TIME = 1481618040000,
TRIGGER_STATE = 'WAITING', -- !!! OK without fired trigger !!!
TRIGGER_TYPE = 'SIMPLE', START_TIME = 1481555640000, END_TIME = 0, CALENDAR_NAME = NULL, MISFIRE_INSTR = 0, PRIORITY = 5, JOB_DATA = '<byte[]>'
WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
-> now it is OK because there is not fired trigger
UPDATE SIMPLE_TRIGGERS SET REPEAT_COUNT = -1, REPEAT_INTERVAL = 60000, TIMES_TRIGGERED = 1104 WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT * FROM FIRED_TRIGGERS WHERE SCHED_NAME = 'scheduler' AND INSTANCE_NAME = 'NON_CLUSTERED' AND REQUESTS_RECOVERY = 1
SELECT TRIGGER_NAME, TRIGGER_GROUP FROM TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_STATE = 'COMPLETE'
DELETE FROM FIRED_TRIGGERS WHERE SCHED_NAME = 'scheduler'
// not important
Result
Job with repeat trigger is executed according trigger definition
BUT BELIEVE it is NOT workaround to restart JVM twice to solve problem. There can be another jobs/triggers in such situation in second restart.
Setting request recovery to true is not workaround either, in our case we definitely do not need recovery request but job must be executed according trigger interval after JVM restart
Perhaps workaround to use (before it is fixed in Quartz) is:
on JVM starting up but before scheduler started
if (!scheduler.getMetaData().isJobStoreClustered()) {
// delete all rows from FIRED_TRIGGERS
// which do not request recovery
DELETE FROM FIRED_TRIGGERS
WHERE SCHED_NAME = scheduler.name AND REQUESTS_RECOVERY = 0
}
scheduler.start()
See pull request #94
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 11
- Comments: 25 (2 by maintainers)
Commits related to this issue
- issue #93 - Fix the jobs recovering (on scheduler startup) blocks simple trigger after failover situation for job which was executing during JVM crash — committed to EugeneGoroschenyaOld/quartz by deleted user 8 years ago
- issue #93 - Fix the jobs recovering (on scheduler startup) blocks simple trigger after failover situation for job which was executing during JVM crash (cherry picked from commit 1afb695) — committed to OpusCapita/quartz by deleted user 8 years ago
I have the same problem
Dear @jhouserizer and @zemian, I kindly ask for feedback to the question from @egoroschenya-sc regarding an official quartz-2.2.4 bugfix version containing the fix. It’s a little more than 1 year since quartz 2.2.3 release.
The issue causes problems with reliability in our software in production environments. Please provide feedback at least, whether and when a release can be expected. We need to decide short term to either wait for it, or deal with the issue on our own in a different way.
I have applied the PR to both quartz-2.2.x and master now. Thanks for everyone in helping out here!
that bug is really annoying, please apply PR as soon as you can