runner: Bug in IF conditions on inputs for reusable workflows

Problem:

When a reusable workflow (called workflow), contains if conditions (based on the input of the job) at job level, there is a observation that the if condition seems to execute and use value globally (causing jobs which are being skipped because of a check).

Example

Caller workflow: Simple workflow calling a reusable workflow passing in the github event (PR or push) to the reusable workflow

name: caller-workflow
on:
  pull_request:
  push:
    branches:
      - master
# 2
jobs:
  call-the-workflow:
    uses: action-foobar/action-testing/.github/workflows/called_workflow.yml@master
    with:
      TRIGGER_EVENT : ${{ github.event_name }}

Reusable workflow: Workflow with a sequence of job which has conditions based on the input (whether it is a pull_request or a push)

name: tf-app-dns-reusable

on:
  workflow_call:
    inputs:
      TRIGGER_EVENT:
        description: trigger event for the workflow
        required: true
        type: string

jobs:
  lint:
    runs-on: ubuntu-latest
    name: Validate terraform configuration

    steps:
      - name: Checkout
        uses: actions/checkout@v2
      - name: debug
        run: |
          echo ${{ inputs.TRIGGER_EVENT }}
  plan-non-prod:
    if: ${{ inputs.TRIGGER_EVENT == 'pull_request' }}
    needs: lint
    runs-on: ubuntu-latest
    name: Plan Application DNS Non Prod
    steps:
      - name: debug
        run: |
          echo "triggered event"
          echo ${{ inputs.TRIGGER_EVENT }}
  plan-prod:
    if: ${{ inputs.TRIGGER_EVENT == 'pull_request' }}
    needs: "plan-non-prod"
    runs-on: ubuntu-latest
    name: Plan Application DNS Prod
    steps:
      - name: debug
        run: |
          echo "triggered event"
          echo ${{ inputs.TRIGGER_EVENT }}
  apply-non-prod:
    if: ${{ inputs.TRIGGER_EVENT == 'push' }}
    needs: lint
    runs-on: ubuntu-latest
    name: apply non prod
    steps:
      - name: debug
        run: |
          echo "triggered event"
          echo ${{ inputs.TRIGGER_EVENT }}

Observation

image

  • Job 3 gets the check skipped , even though the condition is valid and matches as Job 2 (Job 3 only depends on success of job 2 and the same IF condition as job2.

  • On removal of the IF condition on Job 4 , then Job 3 executes fine .

Which implies of some sort of “state” of the IF condition (since there is no relationship between Job 4 and Job 3).

The observation of runs can be found in this repository here -> https://github.com/action-foobar/action-testing/actions/runs/1682109402

PS:

This was working perfectly fine till 11PM GMT 10th Jan . (as we have a lot of jobs in our org on this setup)

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 24
  • Comments: 21

Most upvoted comments

The root cause was a server side change we cannot quickly revert at this time, we are working on a fix that should rollout to all users over the next day or two.

Thank you everyone for your patience as we address this issue. This issue will be updated when the bug is fixed and rolled out to everyone.

👋 Hi folks, I’m sorry about this disruption.

The fix should finish rolling out in the next ~20 minutes. After that, all new runs going forward should work correctly.

Again, I’m sorry about the disruption. We’ll review this, and we’ll do better going forward.

Until the rollout is finished the incident in GitHub Actions should be fired. Workflows are erroneously marked as successful despite not completing all jobs and people may not even be aware of it.

It is an extremely bad error that put all users of GHA at risk of downstream errors. It’s a risk to all GHA users’ products stability and a fix taking days is simply unacceptable. As @jacek-jablonski said It must be classed as an incident with GHA as it clearly broken against the published documentation.

I’m sure many people will have to evaluate whether GHA are ready to be used as a proper CI system, because quite frankly this is a very elementary error (it is such a prevalent feature to use) that somehow wasn’t tested before release.

Furthermore it skipping the check and it being marked as successful are distinct errors, the former is forgivable the latter is lamentable.

Retested and the setup is back to working now @ericsciple - thanks

Separate question though. Is there any documentation on how to run previous versions of the private runner (to counter regression bugs like these) , as we are dependent on CI being production criticality level. Atleast gives us a option of fault tolerance for ourselves

@dptcs

Until the rollout is finished the incident in GitHub Actions should be fired. Workflows are erroneously marked as successful despite not completing all jobs and people may not even be aware of it.

@thboop @ericsciple this was a really bad bug, I’ve already explained my concerns here: https://github.com/actions/runner/issues/1602#issuecomment-1010895916 is it really possible that GitHub thinks it can not report this as an outage and also not explain what happened? Failing builds passed, that’s literally the most fundamental failure a CI system can make.

If there’s someone else I should address, pls let me know.

Can we get clarification on what exactly happened? We had to refactor most of our pipes just in case this issue would be extended over time.

Also, I agree this should have been reported on the status page.

While I don’t agree with the demand for someone to be fired, it is an extremely bad error that put all users of GHA at risk of downstream errors. It’s a risk to all GHA users’ products stability and a fix taking days is simply unacceptable.

I didn’t interpret it as someone being fired. I interpreted it as “the incident should be submitted” e.g. maybe being recognised on https://www.githubstatus.com/

If I’m wrong in that assumption, then yeah, I completely agree.

Our team is looking into this and will follow up shortly.

Thank you for your patience!

@thboop @ericsciple I am still facing this issue with steps containing ‘if’ condition skipped in reusable workflows. Anybody else facing the same?

We’re facing this issue