concourse: Ability to re-trigger failed build with the same input versions

When using the new version: every configuration for a get task, it is possible to arrive at a state where you have multiple builds of the same job running at the same time. If an earlier build fails, there is no way to re-trigger it with the same set of inputs. We haven’t been able to determine a useful workaround; setting serial: true doesn’t really help in this scenario, because the next build will start as soon as the first one fails.

It would be helpful if there were a way to re-trigger the job with the same inputs as a particular build (failed or otherwise).

Let us know if you need more details on this scenario or our desired fix. Thanks!

@davewalter and @rmasand

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 99
  • Comments: 56 (25 by maintainers)

Most upvoted comments

We’re thinking about splitting today’s trigger build button (+). It is primarily used for three things today:

  1. Impatience: I just pushed something or know that someone just published something, and I want the build to run now.
  2. Retrying a build (this issue): I want to re-run the current build, either to see if it’s flaky or to retry because something outside the build failed (e.g. github, a deployment, etc.).
  3. Triggering a job that only ever manually runs, e.g. shipping a product after you’ve written release notes.

The flaw with case 1 is that there’s a race condition. In the time between you loading the page and clicking the +, Concourse may have already found your stuff and queued a build. Now you have two, which is annoying.

The flaw with case 2 is you can only do it with the latest build, and also if you triggered a bunch, new versions may come in, potentially invalidating your flakiness trial. You could set version to a particular version in your pipeline, but that’s annoying.

Case 3 pretty much works, but you don’t know what versions it’ll use until you run it. See https://github.com/concourse/concourse/issues/269

So, I think we should split + into two buttons. One that lives on the job, “sync”, which will make sure everything’s up-to-date and then queue up a build if it should (i.e. one’s not queued already; same semantics as auto-triggering). The other button would be associated with a particular build of the job, and would re-trigger it with the same inputs. This covers cases 1 and 2.

The third case needs some more thinking since a “sync” button alone doesn’t intuitively seem like enough given that the build only manually triggers.

+1000000 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍👍 👍👍👍 👍👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍👍 👍👍👍 👍👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍👍 👍👍👍 👍👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍 👍👍 👍👍👍 👍👍

For those following along: this has been implemented and will be in v6.0! I don’t have an ETA yet since v6.0 includes very substantial internal changes that we’re doing due diligence to test out. We’re considering shipping a beta release first.

Closing this out!

As implemented in v6.0, re-running a build will create a new build named e.g. “123.1”, which will apprear adjacent to the original build in the build history. This placement in the history reflects the scheduler’s iteration order for passed constraints - i.e. if you re-run a very old build it won’t suddenly propagate the older versions downstream if there are “newer” successful builds after the re-run.

The new build will run with a newly constructed build plan based on the pipeline’s current configuration, using the versions of each input from the original build. This should work in the common case of re-triggering recent flakes, but it can fail if the configuration has changed such that new get steps have been added, or the old version is no longer available because the resource changed. Hope that’s good enough for MVP!

In the future we plan to fix this by having re-runs run with the the exact build plan that the original ran with, rather than constructing a new plan based on the current configuration. This is going to be tracked in a separate “epic”, Build Lifecycle.

For those who haven’t seen @vito 's comment in #1172 , I’ll sort of reiterate what vito said. We created #1172 instead of just implementing the ability to re-trigger a single build because looking beyond that one retriggered job, things get pretty confusing. For example, re-triggering a build in the middle of the pipeline won’t result in the rest of the pipeline running if it’s already run with a more recent PR. Even if you have applied version: every, the current pipeline semantics will not let you go back in time and rerun older versions than what’s already been tested. @vito has a more detailed explanation here if you still have questions: https://github.com/concourse/concourse/issues/1172#issuecomment-307224634.

To that point, we are trying to draw up a better solution with #1172 that involves sort of “instances” of a pipeline forked off by every build of a job. This will happen, for example, if you specify forked: true on a job, which will create a new instance of a pipeline for every build that runs from that job and pin down the versions of resources that the build started with. That way, if the job fails in that pipeline instance, you can re-trigger the job and it will run with the pinned down set of inputs and run the rest of the jobs in the forked instance after it goes green. Only the version of the inputs to the “forked” job will be pinned down, and the rest of the jobs will determine their inputs as normal (i.e. latest available candidates). Therefore you would use passed constraints to pin anything beyond the first job, just like you would do in a regular pipeline. In addition, trigger would only apply to resources that came out of the “forked” job, to prevent old instances from constantly running.

This is currently only our initial draft idea, it still needs a lot of adjusting but we’d like to know if this would satisfy the use cases for this issue?

@leshik for now I’m disabling versions of the resource in concourse then triggering the build again. e.g. image

In this particular case I wanted to disable all current PR’s except for one. However, you only need to disable the most recent versions before the version you want to keep (top down).

Now I can trigger my pipeline again with the + button and I get the expected input.

@clarafu I’m not using concourse yet, because of this particular issue.

I want to use concourse for continuous delivery. I would like to have a “button” to deploy to production. If later I need to do a rollback I would like to re-trigger an old successful “build”, using all the versions used on it, to redeploy that same versions on production again.

This “feature” exists in many other CI/CD tools, and for someone (like me) that want to migrate from them to concourse, it’s kind of a deal breaker.

+1

Same here, if you guys are busy can we help somehow?

This is top-of-the-backlog now as we’ll need it for https://github.com/concourse/concourse/issues/3602. The days of pinning and re-triggering and forgetting to un-pin are numbered!

Note: we might go for a quick-and-dirty version of this which directly replaces the build being re-triggered. In the future we’ll want to keep track of each run of the build, but for the sake of unblocking #3602 quickly I think we should just start with this minimum viable solution as it has little to no implications on the UI/build ordering/etc.

@clarafu Point number 2 mentioned in the comment above (retrying flaky or failed build steps) is a primary reason this functionality is needed independent of any other feature to address multiple branch handling.

Sometimes build steps just fail intermittently and you need to re-run them. This isn’t a defense of flaky builds; you may be actively working on addressing the root cause of the intermittency or flakiness, but can’t afford to have your pipeline grind to a halt in the meantime. This is especially true of very large and mature browser integration test suites, which can suck up limitless pair weeks trying to squash all flakiness, and in some cases it’s not even something you can fix (e.g. newly-introduced browser or webdriver bugs).

If you have pipelines with many large, long-running, highly-parallelized test suites, having to re-run the entire pipeline if one single spec flakes out is unacceptable - especially if it’s currently failing somewhat frequently. In this scenario, Concourse seems unusable when compared to other CI tools which have this support - where you can just retry the single step, and the pipeline continues on when it passes.

The original reason for this request was that we manage multiple “environments” (pool resource) that we use to deploy CF, and we have an automated pipeline that cleanses and prepares each environment for use in other pipelines. At any given time, we could have multiple environments going through the same pipeline and, depending on timing and/or build failures, environments can get “lost” in the pipeline as they have been overtaken by later builds. At that point, our only option is to manually un-claim the environment in the pool and let it start from the beginning again. Ideally, in the case of a failure, we would be able to re-trigger the failed job as it was originally run.

The bigger problem as I see it is how to solve the issue of builds of the same job running and the one that started second finishing before the one that started first. In this case, even with version: every turned on, the first build never triggers the next job in the pipeline. We would also get into this situation if the first job failed and we were able to re-trigger it to get it to go green.

We’d really really like this. Is there any progress on this?

is this issue currently being actively worked on internally? I need this asap, but also don’t want to duplicate work. similarly, are there any contributor guidelines I should read?

@clarafu, I’m with @davewalter. I think these are two separate issues.

The workflow of the feature branches does not necessarily solve the problem of retriggering a build. It would be a nice to have along with it.

@davewalter has a great use case where the idempotent explicitness of concourse doesn’t work for automation.

That being said, as the creator of the PR resource, this issue still has my vote. #1172 seems very use case specific.

👍

Retrigerring jobs would be a lot more elegant than empty commits! 😝 👍

Pleaseeeeee 👍

👍 This would go a long way to making Concourse a more viable choice for us.

+1 As well, this is a pretty critical feature. We’ve sort of circumvented it with empty commits (since we’re using the PR resource), but it would be ideal to simply retry a failed job with the same inputs.

Updated the gem to support local user auth in concourse 4, the script should be working

This has evolved into the Build re-triggering track of work. Iterative designs have been moved to smaller sliced stories and can be found in the Build re-triggering project here: https://github.com/concourse/concourse/projects/24

@hstenzel Yep - to re-trigger a build you would do so from the build’s page. There’s no such thing as re-triggering a job - you can trigger a new build of a job, but the re- part of re-trigger means you’re running an already-existing build for a second time with the same inputs.

You will be able to re-trigger a build regardless of whether it succeeded or failed; they both have their use case: re-trigger a succeeded build to detect flakes, re-trigger failed build to allow artifacts to continue along the pipeline.

@simonjohansson realized that my existing job did this a different way, but given that half the work was done I made another script https://gist.github.com/arwineap/3ce8a4c4084b33cc5fd527c871d42c1a

I run on 3.14, but I think it should work on updated versions too. It depends on having basic auth enabled, and the following upstream api endpoints:

GetBuildPlan
GetJobBuild
PauseResource
ListResourceVersions
DisableResourceVersion
CreateJobBuild
GetBuild
EnableResourceVersion
UnpauseResource