runner: Bug in IF conditions on inputs for reusable workflows
Problem:
When a reusable workflow (called workflow), contains if conditions (based on the input of the job) at job level, there is a observation that the if condition seems to execute and use value globally (causing jobs which are being skipped because of a check).
Example
Caller workflow: Simple workflow calling a reusable workflow passing in the github event (PR or push) to the reusable workflow
name: caller-workflow
on:
pull_request:
push:
branches:
- master
# 2
jobs:
call-the-workflow:
uses: action-foobar/action-testing/.github/workflows/called_workflow.yml@master
with:
TRIGGER_EVENT : ${{ github.event_name }}
Reusable workflow: Workflow with a sequence of job which has conditions based on the input (whether it is a pull_request or a push)
name: tf-app-dns-reusable
on:
workflow_call:
inputs:
TRIGGER_EVENT:
description: trigger event for the workflow
required: true
type: string
jobs:
lint:
runs-on: ubuntu-latest
name: Validate terraform configuration
steps:
- name: Checkout
uses: actions/checkout@v2
- name: debug
run: |
echo ${{ inputs.TRIGGER_EVENT }}
plan-non-prod:
if: ${{ inputs.TRIGGER_EVENT == 'pull_request' }}
needs: lint
runs-on: ubuntu-latest
name: Plan Application DNS Non Prod
steps:
- name: debug
run: |
echo "triggered event"
echo ${{ inputs.TRIGGER_EVENT }}
plan-prod:
if: ${{ inputs.TRIGGER_EVENT == 'pull_request' }}
needs: "plan-non-prod"
runs-on: ubuntu-latest
name: Plan Application DNS Prod
steps:
- name: debug
run: |
echo "triggered event"
echo ${{ inputs.TRIGGER_EVENT }}
apply-non-prod:
if: ${{ inputs.TRIGGER_EVENT == 'push' }}
needs: lint
runs-on: ubuntu-latest
name: apply non prod
steps:
- name: debug
run: |
echo "triggered event"
echo ${{ inputs.TRIGGER_EVENT }}
Observation
-
Job 3 gets the check skipped , even though the condition is valid and matches as Job 2 (Job 3 only depends on success of job 2 and the same
IF
condition as job2. -
On removal of the
IF
condition on Job 4 , then Job 3 executes fine .
Which implies of some sort of “state” of the IF condition (since there is no relationship between Job 4 and Job 3).
The observation of runs can be found in this repository here -> https://github.com/action-foobar/action-testing/actions/runs/1682109402
PS:
This was working perfectly fine till 11PM GMT 10th Jan . (as we have a lot of jobs in our org on this setup)
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 24
- Comments: 21
The root cause was a server side change we cannot quickly revert at this time, we are working on a fix that should rollout to all users over the next day or two.
Thank you everyone for your patience as we address this issue. This issue will be updated when the bug is fixed and rolled out to everyone.
👋 Hi folks, I’m sorry about this disruption.
The fix should finish rolling out in the next ~20 minutes. After that, all new runs going forward should work correctly.
Again, I’m sorry about the disruption. We’ll review this, and we’ll do better going forward.
It is an extremely bad error that put all users of GHA at risk of downstream errors. It’s a risk to all GHA users’ products stability and a fix taking days is simply unacceptable. As @jacek-jablonski said It must be classed as an incident with GHA as it clearly broken against the published documentation.
I’m sure many people will have to evaluate whether GHA are ready to be used as a proper CI system, because quite frankly this is a very elementary error (it is such a prevalent feature to use) that somehow wasn’t tested before release.
Furthermore it skipping the check and it being marked as successful are distinct errors, the former is forgivable the latter is lamentable.
Retested and the setup is back to working now @ericsciple - thanks
Separate question though. Is there any documentation on how to run previous versions of the private runner (to counter regression bugs like these) , as we are dependent on CI being production criticality level. Atleast gives us a option of fault tolerance for ourselves
@dptcs
Until the rollout is finished the incident in GitHub Actions should be fired. Workflows are erroneously marked as successful despite not completing all jobs and people may not even be aware of it.
@thboop @ericsciple this was a really bad bug, I’ve already explained my concerns here: https://github.com/actions/runner/issues/1602#issuecomment-1010895916 is it really possible that GitHub thinks it can not report this as an outage and also not explain what happened? Failing builds passed, that’s literally the most fundamental failure a CI system can make.
If there’s someone else I should address, pls let me know.
Can we get clarification on what exactly happened? We had to refactor most of our pipes just in case this issue would be extended over time.
Also, I agree this should have been reported on the status page.
I didn’t interpret it as someone being fired. I interpreted it as “the incident should be submitted” e.g. maybe being recognised on https://www.githubstatus.com/
If I’m wrong in that assumption, then yeah, I completely agree.
Our team is looking into this and will follow up shortly.
Thank you for your patience!
@thboop @ericsciple I am still facing this issue with steps containing ‘if’ condition skipped in reusable workflows. Anybody else facing the same?
We’re facing this issue