spinnaker: Pipeline executions are occasionally stuck
Issue Summary:
Aloha!
We are occasionally seeing that our pipeline executions get stuck, where none of the pipeline stages actually run.
Cloud Provider(s):
GCP
Environment:
Spinnaker deployed into GKE cluster.
Feature Area:
Pipelines
Description:
Behavior that we’re seeing:
- The pipeline’s status says that it is
RUNNING, but the very first stage isNOT_STARTED. The very first stage has no preconditions. - It stays stuck indefinitely like this for hours.
- We encounter this once in a while, but not all the time. Anecdotally, we would say ~1-2% of pipeline executions.
How we’ve tried to remedy this:
- Pressing the cancel button (the red X in the web UI) will show a confirmation dialog, but accepting the dialog causes the dialog to be stuck on the browser window forever.
- Using the
spinCLI to callspin pipeline execution cancel ...will say that the pipeline execution issuccessfully canceled, but nothing actually changes. The UI is still in this state, and subsequent pipeline executions cannot proceed. - When we see this, the current remedy we have is to delete the entire application and re-install pipelines.
- As suggested by @kskewes (thank you!), we could try attempting to delete the associated row in SQL – though would prefer not to have to do this. We’re interested in understanding and fixing the root cause! 🍍 🙂 🍍
- We tried searching through logs for messages containing the associated execution ID, but did not see anything of note.
Steps to Reproduce:
This is hard to reproduce, but anecdotally, it happens in around ~1-2% of our pipeline executions. Our pipeline configuration has not been changing – the same configuration will get stuck sometimes, but run just fine in other times.
Additional Details:
- We’re currently running on Spinnaker
1.20.4on GKE. Our Front50 and Orca are using SQL. OurspinCLI is version1.17.0. - We’re using the webhook API to trigger pipelines, but unsure if this is a contributing factor.
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 1
- Comments: 15 (2 by maintainers)
@friendly-pineapple Hey, I came across a similar issue. I haven’t been able to reproduce it, however following this runbook I was able to at least cancel the execution.