st2: Mistral: 2 bugs in one: workflow timeout/cancellation and wait-before

Hi,

Discovered two bugs while I was trying to reproduce only one 😃

Running latest st2 v2.2.0.

  1. Something is off with cancellation and timed out tasks when nested tasks are mistral workflows. You cancel the parent one, it doesn’t affect children. Also, sometimes cancelling them causes all sorts of unexpected results, like indefinite chatops notifications being triggered. In my example the parent task times out, but the child one keeps running, and running, and running, and running, over and over. If you remove wait-before parameter, then it’s gonna finish after all retries are exhausted. Still, doesn’t mean it’s a valid workaround.

  2. Adding wait-before to a task causes it re-init previously published variables (at least it feels like so).

Providing simple workflows and alias to reproduce it (sorry for the names I gave to them):

wf_cancelation_issue.meta.yaml:

---
name: wf_cancelation_issue
parameters:
  skip_notify:
    default:
      - task
      - error
      - success
    type: array
    description: List of tasks to skip notifications for.
  task:
    type: string
    description: The name of the task to run for reverse workflow.
  workflow:
    type: string
    description: The name of the workflow to run if the entry_point is a workbook
      of many workflows. The name should be in the format "<pack_name>.<action_name>.<workflow_name>".
      If entry point is a workflow or a workbook with a single workflow, the runner
      will identify the workflow automatically.
  context:
    default: {}
    type: object
    description: Additional workflow inputs.
tags: []
description: Reproducing a bug with mistral when task is cancelled or timedout
enabled: true
entry_point: workflows/wf_cancelation_issue.yaml
notify: {}
uid: action:c_int:wf_cancelation_issue
pack: c_int
ref: c_int.wf_cancelation_issue
runner_type: mistral-v2

workflows/wf_cancelation_issue.yaml:

---
version: '2.0'

c_int.wf_cancelation_issue:

  tasks:
    task:
      action: core.noop
      on-success:
        - success

    success:
      action: c_int.wf_cancelation_issue_inner
      timeout: 30

wf_cancelation_issue_inner.meta.yaml:

---
name: wf_cancelation_issue_inner
parameters:
  skip_notify:
    default:
      - task1
      - increase_attempt_number
      - task3
      - end
    type: array
    description: List of tasks to skip notifications for.
  task:
    type: string
    description: The name of the task to run for reverse workflow.
  workflow:
    type: string
    description: The name of the workflow to run if the entry_point is a workbook
      of many workflows. The name should be in the format "<pack_name>.<action_name>.<workflow_name>".
      If entry point is a workflow or a workbook with a single workflow, the runner
      will identify the workflow automatically.
  context:
    default: {}
    type: object
    description: Additional workflow inputs.
  retries:
    type: integer
    required: false
    default: 5

tags: []
description: Reproducing a bug with mistral when task is cancelled or timedout
enabled: true
entry_point: workflows/wf_cancelation_issue_inner.yaml
notify: {}
uid: action:c_int:wf_cancelation_issue_inner
pack: c_int
ref: c_int.wf_cancelation_issue_inner
runner_type: mistral-v2

workflows/wf_cancelation_issue_inner.yaml:

---
version: '2.0'

c_int.wf_cancelation_issue_inner:
  type: direct
  input:
    - retries

  tasks:
    task1:
      action: core.noop
      on-success:
        - increase_attempt_number

    increase_attempt_number:
      action: core.noop
      publish:
        attempt: <% ($.get('attempt') or 0) + 1 %>
      on-success:
        - task3

    task3:
      wait-before: 10
      action: core.local
      input:
        cmd: 'echo <% $.attempt %>; exit 1'
      on-success:
        - end
      on-error:
        - increase_attempt_number: <% $.attempt < $.retries %>

    end:
      action: core.noop

aliases/wf_cancelation_issue.yaml

---
name: alias_wf_cancelation_issue
enabled: true
action_ref: c_int.wf_cancelation_issue
description: Testing timeout and cancelation issue
formats:
  - display: "wf_cancel_test"
    representation:
      - "wf_cancel_test"
ack:
  enabled: true
  format: 'WF Cancelation and Timeout workflow started...'
  append_url: true
result:
  extra:
    slack:
      color: "{% if execution.result is defined and execution.result.extra is defined and execution.result.extra.state is defined and execution.result.extra.state == 'SUCCESS' %}#219939{% else %}#d80015{% endif %}"
  format: |
    WF cancelation and timeout task is complete. {~}
    ```
    {% if execution.result is defined and execution.result.extra is defined and execution.result.extra.state is defined and execution.result.extra.state == 'SUCCESS' %}
    All good.
    {% else %}
    No good.
    {% endif %}
    ```

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 18 (18 by maintainers)

Most upvoted comments

The source of the wait-before bug has been identified at https://bugs.launchpad.net/mistral/+bug/1681562. Please follow the link to review comments. We will need to wait for the rest of the Mistral core team to provide feedback on the use of the cache and how to workaround this issue.

Again, please separate issues in different post next time.

Improvement on 2nd issue w/ canceling subworkflows @ https://github.com/StackStorm/st2/pull/3375

@emptywee so for the workflow below, i did find the workflow to go into infinite loop. the published var is not incremented in this case as you’ve found. Not sure why but we’ll track this as an issue. Thanks for your persistence.

version: '2.0'

sandbox.pub_wait:
    type: direct
    vars:
        var1: 0
        retries: 3
    output:
        vars:
            - <% $.var1 %>
            - <% $.var2 %>
    tasks:
        init:
            action: core.noop
            on-success:
                - task1
        task1:
            action: core.noop
            publish:
                var1: <% $.var1 + 1 %>
            on-success:
                - task2
        task2:
            wait-before: 3
            action: core.local
            input:
                cmd: 'echo "<% $.var1 %>"; exit 1'
            publish:
                var2: <% task(task2).result.stdout %>
            on-error:
                - task1: <% $.var1 <= $.retries %>