cucumber: Replay failed tests

I was wondering if it’s something that we can add to the framework. We already have a function to repeat failed tests output but we don’t have a function to run failed tests again. It’s common to have flaky tests that sometimes can be solved by running them again.

For example :

#[tokio::main]
async fn main() {
    AnimalWorld::cucumber()
        .replay_failed(2)
        .run_and_exit("tests/features/book/output/terminal_repeat_failed.feature")
        .await;
}

Where 2 is the maximum number of times the tests should be run again in case of test failures. Let’s say we have the tests A, B, C and D :

  1. During the first run only A passes
  2. Then B, C and D are run again (1 replay left if we have new test failures)
  3. Only B passes, we run again C and D (0 replay left)
  4. D fails again, this one is certainly more than just unstable

Regarding the output we have two choices :

  • Print all test executions : it’s transparent but can be repetitive when the tests fail multiple times (like D in this example)
  • Or just print the result once the last replay is done (which can be the maximum, here 2, or a previous run if all tests are passing)

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 23 (23 by maintainers)

Commits related to this issue

Most upvoted comments

Thank you very much for the hard work @ilslv @tyranron . That’s a fantastic implementation!

I was a bit busy with other projects, I’ll start to work on it 😃

Discussed with @tyranron:

  1. Remove .after(all) entirely, only .after(3s) is allowed.
  2. Output reties inside single Feature/Rule branch.
  3. Add --retry, --retry-after and --retry-tag-filter CLI options

Detailed explanation of interactions between CLI options and tags

Let’s explore how different sets of CLI options would interact with the following Feature:

Feature: Retries
    @retry
    Scenario: First
      Given a moral dilemma

    @flacky @retry(3)
    Scenario: Second
      Given a moral dilemma

    @retry(4).after(5s)
    Scenario: Third
      Given a moral dilemma

  Scenario: Fourth
    Given a moral dilemma

No CLI options at all

  • First Scenario retried once without a delay
  • Second Scenario retried 3 times without a delay
  • Third Scenario retried 4 times with 5 seconds delay in between
  • Fourth Scenario isn’t retried

--retry=5

  • First Scenario retried 5 times without a delay
  • Second Scenario retried 3 times without a delay
  • Third Scenario retried 4 times with 5 second delay in between
  • Fourth Scenario retried 5 times

--retry-tag-filter='@flacky'

  • First Scenario retried once without a delay
  • Second Scenario retried 3 times without a delay
  • Third Scenario retried 4 times with 5 second delay in between
  • Fourth Scenario isn’t retried

--retry-after=10s

  • First Scenario retried once with 10 seconds delay in between
  • Second Scenario retried 3 times with 10 seconds delay in between
  • Third Scenario retried 4 times with 5 seconds delay in between
  • Fourth Scenario isn’t retried

--retry=5 --retry-after=10s

  • First Scenario retried 5 times with 10 seconds delay in between
  • Second Scenario retried 3 times with 10 seconds delay in between
  • Third Scenario retried 4 times with 5 seconds delay in between
  • Fourth Scenario retried 5 times with 10 seconds delay in between

--retry=5 --retry-tag-filter='@flacky'

  • First Scenario retried once without a delay
  • Second Scenario retried 3 times without a delay
  • Third Scenario retried 4 times with 5 second delay in between
  • Fourth Scenario isn’t retried

--retry=5 --retry-after=10s --retry-tag-filter='@flacky'

  • First Scenario retried 5 times with 10 seconds delay in between
  • Second Scenario retried 3 times with 10 seconds delay in between
  • Third Scenario retried 4 times with 5 second delay in between
  • Fourth Scenario isn’t retried

@ilslv

By that I mean, that if we choose to go this route, retries can be outputted pretty far from each other, so maybe we should give users a clear idea, what went wrong on a previous run(s?).

I think that just print the error “as is” and later having the | Retry #<num> label is more than enough. Like here, but with an error.


I think that we can ditch .after(all) at all.

I’m not against this. I’ve thought about it too.

Actually I’ve already implemented more powerful feature, basically like Cucumber::which_scenario(): closure decides how many and when retries should happen.

That’s OK, but the concern I’ve raised is not about power, but rather ergonomic and CLI. I could easily imagine the situation when someone wants to retry the test suite without populating @retry tags here and there. Like --retry=3 --retry-tag-filter='@webrtc or @http' and then --retry=2 --retry-tag-filter='@webrtc or @http or @animal. It’s OK if it will be built on top of Cucumber::which_scenario, but I’d vote to have this in CLI as the use cases and ergonomics benefits are quite clear.

@theredfish

My bad it wasn’t clear enough but my example was about scenarios not steps.

Actually Scenarios are generally run in parallel, so there is no need for additional complexities you’ve described. We can just rerun failed Scenarios on their own.

I can try to implement this feature if you want ? If you’re available to guide me during the development ?

I’ll be happy to help you with development of this feature!

Thank you for the feedback! Indeed the idea isn’t to encourage to ignore flaky tests but have a way to handle them waiting for a fix.

The tag is a good idea so we offer a different granularity and an explicit way.

I think, that this may lead to unexpected problems: on panic changes from step B may be partially applied

My bad it wasn’t clear enough but my example was about scenarios not steps.

I can try to implement this feature if you want ? If you’re available to guide me during the development ?

@ilslv yup.

It also worths to mark the retried scenarios explicitly in the output, like the following:

Feature: Animal feature
  Scenario: If we feed a hungry cat it will no longer be hungry
    ✔  Given a hungry cat
    ✔  When I feed the cat
    ✔  Then the cat is not hungry
Retry #1 | Feature: Animal feature
  Retry #1 | Scenario: If we feed a hungry cat it will no longer be hungry
    ✔  Given a hungry cat
    ✔  When I feed the cat
    ✔  Then the cat is not hungry

@theredfish I do think, that adding retries for flaky tests is a great feature to have, but I have couple concerns about proposed implementation.

.replay_failed(2)

In addition to specifying number of times test should be retried, I think that we should retry only tests tagged as @flaky or something like that, as being explicit is better that implicit here. Maybe even allow to override this value with something like @flaky(retries = 3). I want this library to be a tool, that is hard to misuse and with defaults that follow best practices. So adding @flaky should be good point of friction for user to think twice, why this Scenario is flaky.

  1. During the first run only A passes
  2. Then B, C and D are run again (1 replay left if we have new test failures)
  3. Only B passes, we run again C and D (0 replay left)
  4. D fails again, this one is certainly more than just unstable

I think, that this may lead to unexpected problems: on panic changes from step B may be partially applied and fully retiring it may cause some unexpected changes in World state. Consider the following Step implementation:

#[given(...)]
fn apply_then_assert(w: &mut World) {
    w.counter += 1;
    assert_eq!(w.counter, 3);
}

Retrying this Step after an assert_eq! will always lead to incrementation of the w.counter. So, as we don’t impose Clone bound on our World (otherwise we would be able to .clone() the World before every Step and rollback to it, if needed), the only option left is to retry entire Scenario.

Or just print the result once the last replay is done (which can be the maximum, here 2, or a previous run if all tests are passing)

I’m not sure this can be achieved with streaming output like ours. And even if it would, I think, that we should be transparent about failed flaky tests and also include stats about them in Summarized Writer.

@tyranron

Flaky tests should not be a common thing.

I saw couple of conference talks, where people from huge FAANG-like companies argued that at this scale flaky tests are inevitable. I’m not sure I agree with them, but there is at least opinions floating around. Also other test runners provide this feature out of the box.