spinnaker: v1.24.x Orca hits 404s attempting to communicate with Clouddriver endpoints

Issue Summary:

Orca hits 404s when attempting to hit Clouddriver endpoints

Cloud Provider(s):

AWS

Environment:

All

Feature Area:

Orca

Description:

After upgrading to 1.24.4, during Deploy phases for EC2, those steps would fail ~80-90% of the time when attempting to communicate with clusters.

The most common error we encountered was during the Disable Cluster or Scale Down Cluster steps. The error generated in these instances was no cluster details found for Cluster aws/$env/$fullClusterName/Moniker(app=$appName, cluster=$clusterName, detail=$detail, stack=$stack, sequence=null).

During a Destroy Cluster step, we encountered the error Unable to locate ancestor_asg_dynamic in $env/us-east-1/$clusterName. This seems related to Issue #6335.

Additionally, during a Rolling Red/Black deploy, we encountered an error on the newly created cluster when it was attempting to scale to 80% of expected capacity. That error was Server group 'us-east-1:$clusterName-v00x' does not exist.

At the root of all of these errors is that c.n.s.orca.clouddriver.OortService returns a 404 when attempting a GET of the relevant clouddriver endpoints for all of the above. Specifically, these endpoints:

$clouddriverUrl/applications/$appName/cluster/$env/$clustername/$cloudProvider (error 1 & 3)
$clouddriverUrl/serverGroups/$env/$region/$fullClusterName (error 2)

Steps to Reproduce:

Be on 1.24.4
Have a pipeline with a deploy step
Be using Redis as a state backend (???)
Attempt a Red/Black deploy, Destroy Server Group, or Rolling Red/Black deploy step

Additional Details:

Note again that these endpoints did not fail 100% of the time, but it was the vast majority of the time.

Also note, that falling back to 1.24.2 seems to have resolved this issue.

I do have the full syslog from the server available, but it would take some time to redact identifying information. Please let me know if you’d like me to do so.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 10
Comments: 32

Most upvoted comments

sounds related to mine, so 👍 https://github.com/spinnaker/spinnaker/issues/6335

snorlaX-sleeps on Feb 12, 2021

We’ve tested with 1.25.2 and most of our EC2 deployments hang, instances are not visible in the clusters section and the asg is not reported healthy thus the deploy stages runs until timeout. We’ve upgraded from 1.22.x where we didn’t had this issue

dancb10 on Mar 11, 2021

@chris-bosman, @vide you don’t see an issue with 1.23.x? Good to know, we might be safe for an upgrade to that then

snorlaX-sleeps on Feb 22, 2021

This seems to be an issue with 1.24, period, rather than specifically 1.24.4. After downgrading to 1.23.6, I’ve gone the entire week without a single 404 failure.

chris-bosman on Feb 18, 2021