quartz: The jobs recovering (on scheduler startup) blocks simple trigger after failover situation for job which was executing during JVM crash

http://www.quartz-scheduler.org/documentation/faq.html

What is Quartz? … Quartz is fault-tolerant …

But there is a problem with it in some cases.

Reproduced with Quartz 2.2.1 and 2.2.3 (didn’t check other versions)

Prerequisites

  • Job has simple trigger to repeat execution with some interval (e.g. minute).
  • Concurrent execution is not allowed.
  • Recovery is not requested.

JVM crashes during a job execution (or was stopped for maintenance during a job execution)

Downtime is much bigger than trigger’s interval (e.g. > 2 minutes)

Important Quartz tables are TRIGGERS and FIRED_TRIGGERS and theirs states directly after JVM crashed are:

TRIGGERS table

TRIGGER_NAME TRIGGER_GROUP JOB_NAME JOB_GROUP NEXT_FIRE_TIME PREV_FIRE_TIME TRIGGER_STATE TRIGGER_TYPE START_TIME MISFIRE_INSTR SCHED_NAME
test TestJob test TestJob 1481618100000 1481618040000 BLOCKED SIMPLE 1481555640000 0 scheduler

FIRED_TRIGGERS table

ENTRY_ID TRIGGER_NAME TRIGGER_GROUP INSTANCE_NAME FIRED_TIME STATE JOB_NAME JOB_GROUP REQUESTS_RECOVERY SCHED_TIME IS_NONCONCURRENT SCHED_NAME
NON_CLUSTERED1481617513489 test TestJob NON_CLUSTERED 1481618049269 EXECUTING test TestJob 0 1481618040000 1 scheduler

The 1st scheduler start after system crashed (or stopped for maintenance)

Below are SQL statements for recoverJobs procedure on scheduler.start() and states of TRIGGERS and FIRED_TRIGGERS tables

UPDATE TRIGGERS SET TRIGGER_STATE = 'WAITING'
  WHERE SCHED_NAME = 'scheduler' AND (TRIGGER_STATE = 'ACQUIRED' OR TRIGGER_STATE = 'BLOCKED')
-> trigger has updated from BLOCKED to WAITING

UPDATE TRIGGERS SET TRIGGER_STATE = 'PAUSED' WHERE SCHED_NAME = 'scheduler' AND (TRIGGER_STATE = 'PAUSED_BLOCKED' OR TRIGGER_STATE = 'PAUSED_BLOCKED')
// not important

SELECT TRIGGER_NAME, TRIGGER_GROUP FROM TRIGGERS
  WHERE SCHED_NAME = 'scheduler' AND NOT (MISFIRE_INSTR = -1) AND NEXT_FIRE_TIME < 1481618263782 AND TRIGGER_STATE = 'WAITING'
  ORDER BY NEXT_FIRE_TIME ASC, PRIORITY DESC
-> trigger has been selected as misfired in WAITING state

SELECT * FROM TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT * FROM SIMPLE_TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_NAME FROM TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_GROUP FROM PAUSED_TRIGGER_GRPS  WHERE SCHED_NAME = 'scheduler' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_GROUP FROM PAUSED_TRIGGER_GRPS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_GROUP = '_$_ALL_GROUPS_PAUSED_$_'
SELECT * FROM JOB_DETAILS WHERE SCHED_NAME = 'scheduler' AND JOB_NAME = 'test' AND JOB_GROUP = 'TestJob'
// not very important, assume it collects information about misfired trigger and there job

SELECT * FROM FIRED_TRIGGERS
  WHERE SCHED_NAME = 'scheduler' AND JOB_NAME = 'test' AND JOB_GROUP = 'TestJob'
-> fired trigger has been selected for misfired trigger

UPDATE TRIGGERS SET JOB_NAME = 'test', JOB_GROUP = 'TestJob', DESCRIPTION = NULL, NEXT_FIRE_TIME = 1481618340000, PREV_FIRE_TIME = 1481618040000,
  TRIGGER_STATE = 'BLOCKED', -- !!!! IMPORTANT !!!!
  TRIGGER_TYPE = 'SIMPLE', START_TIME = 1481555640000, END_TIME = 0, CALENDAR_NAME = NULL, MISFIRE_INSTR = 0, PRIORITY = 5, JOB_DATA = '<byte[]>'
  WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
-> (IMPORTANT) trigger which is misfired and fired (because in execution on JVM crash/stop) at the same time
    has been updated to BLOCKED state on scheduler start

UPDATE SIMPLE_TRIGGERS SET REPEAT_COUNT = -1, REPEAT_INTERVAL = 60000, TIMES_TRIGGERED = 1045  WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
// not important

SELECT * FROM FIRED_TRIGGERS WHERE SCHED_NAME = 'scheduler' AND INSTANCE_NAME = 'NON_CLUSTERED' AND REQUESTS_RECOVERY = 1
// not important, assume handling triggers requested recovery

SELECT TRIGGER_NAME, TRIGGER_GROUP FROM TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_STATE = 'COMPLETE'
// not important, assume select to remove stale triggers

DELETE FROM FIRED_TRIGGERS WHERE SCHED_NAME = 'scheduler'
-> fired triggers are removed

TRIGGERS table

TRIGGER_NAME TRIGGER_GROUP JOB_NAME JOB_GROUP NEXT_FIRE_TIME PREV_FIRE_TIME TRIGGER_STATE TRIGGER_TYPE START_TIME MISFIRE_INSTR SCHED_NAME
test TestJob test TestJob 1481618100000 1481618040000 BLOCKED SIMPLE 1481555640000 0 scheduler

FIRED_TRIGGERS table

No rows

Problem

Job has repeat trigger but in BLOCKED state, trigger will not fired, job will not executed at least until JVM is not restarted again

The 2nd scheduler start (just for test purposes)

Below are SQL statements for recoverJobs procedure on scheduler.start() and states of TRIGGERS and FIRED_TRIGGERS tables

UPDATE TRIGGERS SET TRIGGER_STATE = 'WAITING'
  WHERE SCHED_NAME = 'scheduler' AND (TRIGGER_STATE = 'ACQUIRED' OR TRIGGER_STATE = 'BLOCKED')
-> TRIGGER has updated from BLOCKED to WAITING

UPDATE TRIGGERS SET TRIGGER_STATE = 'PAUSED' WHERE SCHED_NAME = 'scheduler' AND (TRIGGER_STATE = 'PAUSED_BLOCKED' OR TRIGGER_STATE = 'PAUSED_BLOCKED')
// not important

SELECT TRIGGER_NAME, TRIGGER_GROUP FROM TRIGGERS
  WHERE SCHED_NAME = 'scheduler' AND NOT (MISFIRE_INSTR = -1) AND NEXT_FIRE_TIME < 1481621819212 AND TRIGGER_STATE = 'WAITING'
  ORDER BY NEXT_FIRE_TIME ASC, PRIORITY DESC
-> trigger has been selected as misfired in WAITING state

SELECT * FROM TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT * FROM SIMPLE_TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_NAME FROM TRIGGERS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_GROUP FROM PAUSED_TRIGGER_GRPS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_GROUP = 'TestJob'
SELECT TRIGGER_GROUP FROM PAUSED_TRIGGER_GRPS WHERE SCHED_NAME = 'scheduler' AND TRIGGER_GROUP = '_$_ALL_GROUPS_PAUSED_$_'
SELECT * FROM JOB_DETAILS WHERE SCHED_NAME = 'scheduler' AND JOB_NAME = 'test' AND JOB_GROUP = 'TestJob'
// not very important, assume it collects information about misfired trigger and there job

SELECT * FROM FIRED_TRIGGERS
  WHERE SCHED_NAME = 'scheduler' AND JOB_NAME = 'test' AND JOB_GROUP = 'TestJob'
-> fired trigger has been selected for misfired trigger

UPDATE TRIGGERS SET JOB_NAME = 'test', JOB_GROUP = 'TestJob', DESCRIPTION = NULL, NEXT_FIRE_TIME = 1481621880000, PREV_FIRE_TIME = 1481618040000,
  TRIGGER_STATE = 'WAITING', -- !!! OK without fired trigger !!!
  TRIGGER_TYPE = 'SIMPLE', START_TIME = 1481555640000, END_TIME = 0, CALENDAR_NAME = NULL, MISFIRE_INSTR = 0, PRIORITY = 5, JOB_DATA = '<byte[]>'
  WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
-> now it is OK because there is not fired trigger

UPDATE SIMPLE_TRIGGERS SET REPEAT_COUNT = -1, REPEAT_INTERVAL = 60000, TIMES_TRIGGERED = 1104 WHERE SCHED_NAME = 'scheduler' AND TRIGGER_NAME = 'test' AND TRIGGER_GROUP = 'TestJob'
SELECT * FROM FIRED_TRIGGERS WHERE SCHED_NAME = 'scheduler' AND INSTANCE_NAME = 'NON_CLUSTERED' AND REQUESTS_RECOVERY = 1
SELECT TRIGGER_NAME, TRIGGER_GROUP FROM TRIGGERS  WHERE SCHED_NAME = 'scheduler' AND TRIGGER_STATE = 'COMPLETE'
DELETE FROM FIRED_TRIGGERS WHERE SCHED_NAME = 'scheduler'
// not important

Result

Job with repeat trigger is executed according trigger definition

BUT BELIEVE it is NOT workaround to restart JVM twice to solve problem. There can be another jobs/triggers in such situation in second restart.

Setting request recovery to true is not workaround either, in our case we definitely do not need recovery request but job must be executed according trigger interval after JVM restart

Perhaps workaround to use (before it is fixed in Quartz) is:

  on JVM starting up but before scheduler started
  if (!scheduler.getMetaData().isJobStoreClustered()) {
    // delete all rows from FIRED_TRIGGERS
    // which do not request recovery
    DELETE FROM FIRED_TRIGGERS
      WHERE SCHED_NAME = scheduler.name AND REQUESTS_RECOVERY = 0
  }
  scheduler.start()

See pull request #94

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 11
  • Comments: 25 (2 by maintainers)

Commits related to this issue

Most upvoted comments

I have the same problem

Dear @jhouserizer and @zemian, I kindly ask for feedback to the question from @egoroschenya-sc regarding an official quartz-2.2.4 bugfix version containing the fix. It’s a little more than 1 year since quartz 2.2.3 release.

The issue causes problems with reliability in our software in production environments. Please provide feedback at least, whether and when a release can be expected. We need to decide short term to either wait for it, or deal with the issue on our own in a different way.

I have applied the PR to both quartz-2.2.x and master now. Thanks for everyone in helping out here!

that bug is really annoying, please apply PR as soon as you can