spinnaker: Pipeline executions are occasionally stuck

Issue Summary:

Aloha!

We are occasionally seeing that our pipeline executions get stuck, where none of the pipeline stages actually run.

Cloud Provider(s):

GCP

Environment:

Spinnaker deployed into GKE cluster.

Feature Area:

Pipelines

Description:

Behavior that we’re seeing:

  • The pipeline’s status says that it is RUNNING, but the very first stage is NOT_STARTED. The very first stage has no preconditions.
  • It stays stuck indefinitely like this for hours.
  • We encounter this once in a while, but not all the time. Anecdotally, we would say ~1-2% of pipeline executions.
pipeline_stuck

How we’ve tried to remedy this:

  • Pressing the cancel button (the red X in the web UI) will show a confirmation dialog, but accepting the dialog causes the dialog to be stuck on the browser window forever.
  • Using the spin CLI to call spin pipeline execution cancel ... will say that the pipeline execution is successfully canceled, but nothing actually changes. The UI is still in this state, and subsequent pipeline executions cannot proceed.
  • When we see this, the current remedy we have is to delete the entire application and re-install pipelines.
  • As suggested by @kskewes (thank you!), we could try attempting to delete the associated row in SQL – though would prefer not to have to do this. We’re interested in understanding and fixing the root cause! 🍍 🙂 🍍
  • We tried searching through logs for messages containing the associated execution ID, but did not see anything of note.

Steps to Reproduce:

This is hard to reproduce, but anecdotally, it happens in around ~1-2% of our pipeline executions. Our pipeline configuration has not been changing – the same configuration will get stuck sometimes, but run just fine in other times.

Additional Details:

  • We’re currently running on Spinnaker 1.20.4 on GKE. Our Front50 and Orca are using SQL. Our spin CLI is version 1.17.0.
  • We’re using the webhook API to trigger pipelines, but unsure if this is a contributing factor.

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 1
  • Comments: 15 (2 by maintainers)

Most upvoted comments

@friendly-pineapple Hey, I came across a similar issue. I haven’t been able to reproduce it, however following this runbook I was able to at least cancel the execution.