github-action: Github actions: "Re-run failed jobs" will run the entire test suite

We have the e2e tests configure to run on cypress dashboard parellely. I was following this thread to add the custom-build-id to the command to distinguish different run based on different build id. Every thing works fine until github actions roll out the ability to Re-run failed jobs. If i just set the custom-build-id to ${{ github.run_id }}, the second attempt will always marks tests as passing with ‘Run Finished’ but tests are not triggered at all. So I set set the custom-build-id to ${{ github.run_id }}-${{ github.run_attempt }}, now it will run the entire test suite instead of running the originally allocated subset of tests.

 E2E_tests:
    runs-on: ubuntu-latest
    name: E2E tests
    strategy:
      fail-fast: false
      matrix:
        ci_node_total: [6]
        ci_node_index: [0, 1, 2, 3, 4, 5]
    timeout-minutes: 45
    steps:
        - uses: actions/checkout@v2

        - name: Use Node.js
          uses: actions/setup-node@v2

        - name: Install Dependencies
          run: npm ci

        - name: Start app
          run: make start-app-for-e2e
          timeout-minutes: 5

        - name: Cypress Dashboard - Cypress run
          run: |
              npm run cypress

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 28
  • Comments: 60 (13 by maintainers)

Most upvoted comments

The ability to re-run failed tests is becoming more and more necessary as we scale; it’s making us consider alternatives to Cypress cloud

Thank you for bumping this. This is currently under discussion as it relates to other work and Cypress Cloud features. There is a sense of urgency, but a solution isn’t quite as simple as it sounds. I’ll continue to update this thread as information is available.

I was able to get some more clarity on this from our Cloud team. Issue #574 also has some additional context.

Here is the current status:

  • Before, there was an issue where all re-runs got a PASS, regardless of actual status. This issue has been fixed.
  • Currently, if a re-run is initiated, all specs get run on the machines available. That is not optimal. The Cloud team is looking into the connection between GH Actions and Cypress in order to set up re-runs to be accurate and efficient.

I will be updating this issue as new information is available.

Hi! As @jaffrepaul mentioned, we’re exploring various ways to solve this on Cypress Cloud. These are a few of the potential directions towards which we are leaning:

  1. Add a “spec retries” feature, whereby specs with the failing tests will be automatically retried at the end of the test run. This could also include some scoping configurations, for instance, if you want to only retry tests which just failed in the latest run, but passed in previous runs (as opposed to “known issue” tests which failed over multiple consecutive runs).

  2. Add a “environment stability test and cooldown” feature, whereby this custom-to-your-testing-environment test would execute, perhaps under conditions like the occurrence of “new” failures as described above or just periodically in between batches of tests, and, if it fails, would pause the execution of more tests until a specified wait period passes, thereby allowing your infrastructure sufficient time to cool down from memory leaks etc. and become stable again.

As you can tell, we’re leaning towards more automated, less CI-specifc solutions in these cases, as opposed to a “retry button” integration. Requiring you to inspect test failures and click these buttons on a per job basis seems like a good opportunity for automation. Also, and this is mostly a Cypress concern, we would need to not only add support for GitHub Actions’ “Re-run failed jobs”, but also other CI providers. And even if we did use an automated GHA “Re-run failed jobs” (as opposed to the manual mode), that implementation adds challenges for us in terms of adding more cross-run test-result linking in Cloud.

I can’t yet give an estimate of when we would address this specific issue, however we are currently actively working on improvements to Cypress’ failure retries and there’s a strong chance this could be worked on in one of the next phases of this project.

That’s definitly something critical given how the billing works (Cypress and Github included), it sounds like we’re getting billed for something that already passed

GitHub Actions “Re-run failed jobs” is useful to repeat workflow runs that have failed due to GitHub issues such as temporary network connectivity. It does not seem to be helpful to re-run Cypress tests in general, because they would need to run on a corrected version of the AUT.

Aside from github issues, its useful to be able to re-run due to issues with any other external dependency.

Yes, I also tried to replicate this last night and saw this same behavior:

I clicked the “re-run failed jobs” button in GitHub and that kicked off the Cypress tests again in the same job containing the failed test. But, instead of running the same set of tests, it re-ran all of the tests in the single job. I have included a screenshot that should hopefully illustrate this a little better.

This by default will fail the job, because one single worker can’t possibly run all of the tests before the job timeout kicks in (which is why it is parallelized in the first place). We are on 10.7.0.

I believe what would need to happen is for Cypress to remember which tests get allocated to which workers so that if there is a failure on worker 3 of 5, and “re-run failed jobs” is selected on the GHA side, the same set of tests will get re-run on that worker.

There were recently some changes in our services repo that may have taken care of this issue. Can someone retest with 10.7.0 or later and post results? Thanks!

@admah I just tested this after upgrading to 10.8.0 and still saw all of the tests run in a single job when one of the parallelized containers had a failed test.

To give some more detail, the codebase I am working on uses the Cypress parallelization feature, attached to Cypress dashboard, to split our test suite into 5 different jobs. In this situation, one test failed in one of the parallelized jobs. To retry this test, I clicked the “re-run failed jobs” button in GitHub and that kicked off the Cypress tests again in the same job containing the failed test. But, instead of running the same set of tests, it re-ran all of the tests in the single job. I have included a screenshot that should hopefully illustrate this a little better.

Thanks for looking into this, it would be a huge improvement to our CI pipeline if this issue was resolved!

Screen Shot 2022-09-14 at 6 44 52 AM

Thank you for elevating this. Our teams have shuffled a bit of late so I’ll see if I can get worthwhile information on this. Stay tuned 💪

It feels like Cypress Dashboard is able to store the GHA run (which changes by commit)- is it too far a leap to potentially store the failures for a given run ID and (if the number is greater than 0) simply use the array of failed specs when it’s rerun? I doubt anyone’s really re-running their tests for the fun of it.

Please prioritize this, it’s really driving up our costs of cypress.io and certainly our frustrations with the paid service.

In the meantime, would it possible to disable “rerun only failed tests” until this is resolved? It compounds this issue when you have to continually communicate to large teams not use rerun only failed tests since it’ll be a waste / timeout.

@admah Any news? Thanks

Any news on this thread? Seems like it went quiet for a couple of months. My team is experiencing this issue as well.

thanks @piotrekkr - I however found this https://github.com/bahmutov/cypress-split to be working better than passing the parallel flag. Currently it works just as good as the parallel function, and does not require me to cache and use some run ID for the job to execute on reruns etc

Good stuff @piotrekkr.

I’ve forwarded the information (the original issue) to the Cloud team. There has been some preliminary work done to rectify running only failed tests. There’s more work required to get it over the hump. I’ll report back when I get a clearer sense of the teams ability to prioritize this. There isn’t a way that I’m aware of to remove the option in the GH UI in the interim.

@Git2DaChoppaa

According to https://www.linkedin.com/in/amu/ Adam Murray (@admah) doesn’t work for Cypress.io any more.

All those problems could be fixed if dashboard could work this way for same dashboard run KEY.

7 test, 3 workers

first run

  • run all tests and load balance them on all workers
  • 5 / 7 tests green, 2 workers failed

next runs with same cypress KEY

  • check dashboard result for given KEY and failed tests
  • run all failed tests and load balance them on two available runners (two failed workers so on rerun github provides only those two)

At least this would work fine with GitHub imho.

Not sure how hard is it to implement but it is on dashboard side to orchestrate and send tests to workers so my guess would be that this should not be very hard. Unless somehow done tests suites runs cannot be updated…

both are bad, think about the billing one test fail … should be 1-2 min you’re billed about 25x more core now in my previous example

and that is regardless is the parallel matrix respected or not since all test / run minutes are accounted

i think not trimming in parallel would be less critical if

  • it only re-ram the fail test
  • or a threshold on rerun to split
  • or rerun the same container count but only what failed in each so that “job 1” would still be “job 1”

for now it’s unpredictable / and full rerun / full billing only

Is there any update on this? Getting charged for an entire test suite re-run when one test fails on one parallel job is really upsetting, given the size of our test suite.