spinnaker: v1.24.x Orca hits 404s attempting to communicate with Clouddriver endpoints
Issue Summary:
Orca hits 404s when attempting to hit Clouddriver endpoints
Cloud Provider(s):
AWS
Environment:
All
Feature Area:
Orca
Description:
After upgrading to 1.24.4, during Deploy phases for EC2, those steps would fail ~80-90% of the time when attempting to communicate with clusters.
The most common error we encountered was during the Disable Cluster or Scale Down Cluster steps. The error generated in these instances was no cluster details found for Cluster aws/$env/$fullClusterName/Moniker(app=$appName, cluster=$clusterName, detail=$detail, stack=$stack, sequence=null).
During a Destroy Cluster step, we encountered the error Unable to locate ancestor_asg_dynamic in $env/us-east-1/$clusterName. This seems related to Issue #6335.
Additionally, during a Rolling Red/Black deploy, we encountered an error on the newly created cluster when it was attempting to scale to 80% of expected capacity. That error was Server group 'us-east-1:$clusterName-v00x' does not exist.
At the root of all of these errors is that c.n.s.orca.clouddriver.OortService returns a 404 when attempting a GET of the relevant clouddriver endpoints for all of the above. Specifically, these endpoints:
- $clouddriverUrl/applications/$appName/cluster/$env/$clustername/$cloudProvider (error 1 & 3)
- $clouddriverUrl/serverGroups/$env/$region/$fullClusterName (error 2)
Steps to Reproduce:
- Be on 1.24.4
- Have a pipeline with a deploy step
- Be using Redis as a state backend (???)
- Attempt a Red/Black deploy, Destroy Server Group, or Rolling Red/Black deploy step
Additional Details:
Note again that these endpoints did not fail 100% of the time, but it was the vast majority of the time.
Also note, that falling back to 1.24.2 seems to have resolved this issue.
I do have the full syslog from the server available, but it would take some time to redact identifying information. Please let me know if you’d like me to do so.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 10
- Comments: 32
sounds related to mine, so 👍 https://github.com/spinnaker/spinnaker/issues/6335
We’ve tested with 1.25.2 and most of our EC2 deployments hang, instances are not visible in the clusters section and the asg is not reported healthy thus the deploy stages runs until timeout. We’ve upgraded from 1.22.x where we didn’t had this issue
@chris-bosman, @vide you don’t see an issue with 1.23.x? Good to know, we might be safe for an upgrade to that then
This seems to be an issue with 1.24, period, rather than specifically 1.24.4. After downgrading to 1.23.6, I’ve gone the entire week without a single 404 failure.