st2: Death of st2actionrunner process causes action to remain running forever

SUMMARY

Using StackStorm 3.0.1, if something kills an st2actionrunner process supervising a python-script action runner, and this action execution is part of a workflow execution, the action execution remains forever in state running regardless of parameters: timeout setting in the workflow.

What I’d like to see, is the action being rescheduled to another st2actionrunner, or at the very least, timed out so that a retry in the workflow can deal with that problem.

(It is not clear either how StackStorm deals with the death of a st2actionrunner supervising an orquesta action runner.)

This is not an HA setup, but nothing in the code or documentation leads me to believe that the expected behavior is to just hang a workflow execution when the underlying action runner supervisor process is gone. I’m thinking a machine in an HA setup crashes while ongoing workflows are executing actions on that machine, and then all workflows whose actions were running there, just hang, never to even timeout.

We expect to be able to run StackStorm for weeks on end, with long-running workflows that survive the death or reboot of a machine that is part of the StackStorm cluster.

OS / ENVIRONMENT / INSTALL METHOD

Standard non-HA recommended setup in Ubuntu 16.04

STEPS TO REPRODUCE

Create workflow with one Python action that runs sleep 60 via subprocess. Start workflow with st2 run. Kill st2actionrunner supervising the Python action. Wait forever.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 22 (12 by maintainers)

Most upvoted comments

Hi there - Checking whether this is still on the road map for any releases near soon? This requirement has real significance in the stackstorm-ha world especially when the nodes/pods get killed/restarted often in the k8s world compared to the traditional deployment model.

anrajme on Jun 20, 2022

Looks like there are similar settings for the workflowengine. again you will also have to set the terminationGracePeriodSeconds in your chart to a sane time.

guzzijones on Jul 19, 2023

looks like you have to also increase the terminationGracePeriodSeconds in your chart. The default is 30 seconds.

guzzijones on Jul 19, 2023

If the kill signal is sent to an actionrunner it should wait till the action finishes if you have graceful shutdown on. Do you have gracefull shutdown enabled in the config? There is also an exit timeout and sleep delay setting.code

guzzijones on Jul 19, 2023

Also, I think the action runner and workflow engine need to support a “warm shutdown” TERM signal to the process. The idea being that they should finish their work before they exit, minimizing orphaned actions or lost workflow state.

For the workflow engine, this may mean initiating a pausing/paused before shutting the process.

For the action runner, this may mean that it stops accepting any new work, and completes it’s current running work before exiting (with a hard timeout value).

We use this type of behavior for Celery workers today. See: http://docs.celeryproject.org/en/master/userguide/workers.html#stopping-the-worker

johnarnold on Aug 9, 2019

Rerunning/rehydrating a workflow from a given state would be essentially equivalent to this.

This would be adequate for our use cases. Otherwise Orquesta basically makes it impossible to put the machines running the workflow engine in maintenance mode.

We would prefer the workflow’s complete state (including published variables and current threads) be captured in a persistent manner within the database, such that the workflow can restart if the workflow engine is moved to a different box. This would be essentially what Jenkins does w.r.t. pipelines when the master restarts – it persists the state of the pipelines, then when it reconnects to slaves, it catches up with what the slaves were doing.

Rudd-O on Jun 26, 2019