spinnaker: Same GCB build can trigger many pipeline executions

Issue Summary:

The exact same GCB build can trigger many executions of the same pipeline.

Cloud Provider(s):

GCP

Environment:

Spinnaker 1.16.4. Spinnaker for GCP.

Feature Area:

GCB Triggers / PubSub / Echo

Description:

We have a pipeline configured with two GCB triggers (pointing to different Git repos), so that each successful build will trigger a pipeline execution. However, I’m finding that each successful GCB build will trigger many executions – I’ve seen up to 10 or so queued executions for the same GCB build, at which point I’ve had to disable the pipeline and manually cancel all queued executions.

Steps to Reproduce:

This isn’t reproducible: This pipeline is generated with sponnet, just like two more pipelines that don’t seem affected by this issue so far. Each pipeline is responsible for delivering the same software to different Kubernetes clusters, although those artifacts come from different Git branches and therefore use different GCB triggers. All GCB triggers used by these pipelines share the same GCB account in Spinnaker.

Additional Details:

I confirmed the executions were triggered by the same GCB build by looking at their sources in Gate (https://my-spinnaker.com/gate/pipelines/{pipeline-id}). For example, here you can see the relevant data for two of those executions:

$ jq '{id: .id, git_commit: .trigger.payload.sourceProvenance.resolvedRepoSource.commitSha, gcb_url: .trigger.payload.logUrl }' exec1.json 
{
  "id": "01DRH3KH6MAE2THEMGBMG08F5T",
  "git_commit": "1d8b769a26e44abdae5da2a58adc6af012496d7d",
  "gcb_url": "https://console.cloud.google.com/gcr/builds/c7d43830-362b-4467-96ad-543f1147bdda?project=12345"
}

$ jq '{id: .id, git_commit: .trigger.payload.sourceProvenance.resolvedRepoSource.commitSha, gcb_url: .trigger.payload.logUrl }' exec2.json 
{
  "id": "01DRH3JRQC204CWDBXRH1D3B2H",
  "git_commit": "1d8b769a26e44abdae5da2a58adc6af012496d7d",
  "gcb_url": "https://console.cloud.google.com/gcr/builds/c7d43830-362b-4467-96ad-543f1147bdda?project=12345"
}

I suspect Spinnaker is failing to acknowledged the PubSub messages on time:

Cloud Pub/Sub guarantees at-least-once message delivery, which means that occasional duplicates are to be expected. However, a high rate of duplicates may indicate that the client is not acknowledging messages within the configured ack_deadline_seconds, and Cloud Pub/Sub is retrying the message delivery. (…) Another possibility is that the subscriber is not acknowledging some messages because the code path processing those specific messages fails, and the Acknowledge call is never made; or the push endpoint never responds or responds with an error.

https://cloud.google.com/pubsub/docs/faq#duplicates

I also had a look at the GCP PubSub metrics and this is what I found (data below is limited to the Spinnaker service account in the past 24 hours):

Screenshot 2019-10-31 at 16 02 18

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 22 (12 by maintainers)

Most upvoted comments

@bsda : Thanks a lot for digging into this! It looks like the root cause here is that echo is calling the /history endpoint with limit=1 every time it needs the most recent pipeline definition. That endpoint is pretty inefficient as it grabs the entire history and sorts it.

I’ve added a new endpoint in spinnaker/front50#644 that will just get the most recent definition, and will migrate existing usages over to it. That should completely resolve your issue with extremely long calls to front50.

Here is my update with our findings after upgrading the version of spinnaker; We no longer see this issue.

In summary:

  • we updated from 1.17.5 to 1.19.8
  • The last occurrence of the Error triggering Pipeline message appears just before the update was started
  • the problem was occurring on pipelines which had the setting to allow concurrent pipeline executions. We tested with this setting toggled to allow concurrent executions and saw no instances of a single webhook triggering multiple pipeline executions

As such I don’t believe this warrants a seperate issue to be created as it has likely been resolved by the same orca bug that was addressed in the 1.18.x release

Thanks @ezimanyi for the response. We are currently running spinnaker 1.17.5 and plan to update to 1.19.8 next week so will be watching to see if this resolves our issue. There is definately something up with orca because we see the echo -> orca call being made and then echo registers an error in its log:

Error triggering Pipeline(apply, APPLICATIONNAME & ID) with retrofit.RetrofitError: timeout (attempt 1/5). Retrying...

We also notice that the second attempt to trigger is made in under 20 seconds so I think, but can’t see a log entry confirming this, that orca sends echo a non-200 response even though it does actually trigger a pipeline.

I will post back here with findings after our update before deciding if a new issue is needed

From our point of view the actions suggested fixed our issue and this ticket can be closed.

To answer the question above about whether there is functionality to delete history of a specific pipeline, I don’t believe that there is in Spinnaker. (There is, as you note, a way to delete the history of executions, but as far as I’m aware no way to delete the history of changes to the config itself.)

One solution that might work for you is to enable Object Lifecycle Management on the bucket and configure it to delete objects when there are more than N newer versions. Once the fix I note above has been released, that also should fix this performance issue for you.

@ademariag are you using GCS for front50? I changed my bucket to be regional and deployed spinnaker from scratch. The same pipeline works now fine and a request that was taking more than 20s now is very quick HTTP 200 http://spin-orca.default:8083/orchestrate (123ms). I don’t know if this might be related with having lots of entries on the pipeline history…

@gnarea : I think the part you extracted from your echo logs is indeed the interesting part. It looks like it’s taking orca longer than the 20s network timeout to process the call to /orchestrate which is causing echo to continually re-try it. (This makes me wonder if in fact we should not retry failed POST requests on network failures, as we’d have no way of knowing whether the original request completed or not.)

Looking into the orca logs for some more details, it seems like orca is getting a lot of failures on calling out to front50. I’m wondering if what’s happening is there are a lot of backed up network connections from orca -> front50 which is causing the echo -> orca connections to time out (as orca needs to fetch the pipeline from front50 to serve the /orchestrate endpoint).

I think the next step to debug is to figure out why these /orchestrate calls are taking > 20s to return. Per the above hypothesis, this might involve seeing whether front50 is overloaded or if there’s something else in that endpoint taking a long time.

You might also try as a workaround setting this in your echo-local.yml:

okHttpClient:
    readTimeoutMs: 120000

That would increase the timeout on network calls to 2 min, hopefully reducing the number of retries on network failures—but this is really just a workaround as there’s definitely something else going on here if your /orchestrate calls are routinely taking more than the default 20s.