spinnaker: EC2 Deploy stage does not disable and scale down previous servergroup

Issue Summary:

EC2 Deploy stage does not disable and scale down previous servergroup

Cloud Provider(s):

AWS EC2

Environment:

Spinnaker 1.25.1

Feature Area

Pipelines

Description:

After upgrading from 1.23.0 to 1.25.1 we see that the EC2 deploy stage ends successfully by leaving the previous server group up and running. This results in both server groups (old and new versions) being registered at the ELB. We have set the following parameters on our Deploy stage:

"scaleDown": true,
"maxRemainingAsgs": 3,

We would expect that after each successful deploy only the newest server group is enabled and scaled accordingly to the scaling config. And the previous two server group version should be scaled to zero and disabled.

Steps to Reproduce:

  • Create a Pipeline with a Bake and Deploy stage.
  • set "scaleDown": true and "maxRemainingAsgs": 3 for the Deploy stage
  • let it run several times

Additional Details:

Also discussed in slack: https://spinnakerteam.slack.com/archives/CB78MFQPP/p1615223274013400

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 4
  • Comments: 37 (1 by maintainers)

Most upvoted comments

A small update to the original post of my colleague @jgrumboe here: we have just swapped the Clouddriver to use MySQL instead of Redis (a side note, we’re in GCP with Spinnaker - so went from Cloud Memorystore to CloudSQL but not fully, see the migration instructions). We went at the end from 1.23.7 to 1.26.6.

I’ve done now maybe… 20-ish EC2-based deployments, and no “failures” like this at all. We will know more in a week when the teams (who still do EC2) continue their eployments - so we’ll bump and update then.

Thanks for the tips that this could be a “Clouddriver-Redis-only” issue! Really looks like this now…

I added this to the AWS SIG meeting agenda. We will definitely talk about this 😃

Is that really a solution though? “Just don’t use that persistence provider”, to me that is a workaround not a fix. Is Redis no longer supported for Clouddriver?

Version. 1.26.4

I’ve configured some delay for disable and scale down and then afterwards started seeing more consistent results. Just a wild guess - maybe it is related to some kind of race condition. I’ve had success with 5 out of 5 deployments, while it isn’t a permanent fix, it might provide a way to go around the issue.

image

After downgrading to 1.23.x we are not seeing the issue, so hopefully that helps track down the problem. Another thing I noticed, we have several AWS accounts but this only occurred in one account. That account is by far the largest in terms of EC2 usage so I’m guessing that is a factor as well.

@xibz sorry could not join the SIG today but will join the next…

@matthack huge thanks for the update, interesting detail. We’ll do the same when possible and monitor & report back what happens.

I did an experiment where I cut over our Clouddriver from Redis to SQL and Spinnaker appears to again do red black as expected, sampling 12 runs of our deployment. Previously we were using Redis for Clouddriver’s backend cache and about half of our red black deployments would fail to scale down. We’re using 1.26.3.

If SQL is expected to be a requirement now, it should probably be called out to clearly in the docs and not labeled as an improvement or a way to make your spinnaker deployment more robust.

There’s an aws sig meeting this week. Not sure if you’ve already tried joining and bringing up this issue, but that might help get it on more people’s radar.

@matthack sounds weird, but we’re glad of not being the only one having this issue. We’re running orca on SQL and clouddriver on redis. Currently, we’re still on 1.23.7. Haven’t tested latest versions so far.

@dreynaud @cfieber Do you have any new thoughts on this? What else could we try?

@dreynaud I run the pipeline now again with 1.25.2. This were the servergroups existing when starting the pipeline:

  • exampleapp-design-v683 (active)
  • exampleapp-design-v682 (disabled, scaled to zero)
  • exampleapp-design-v681 (disabled, scaled to zero)

This was the result after the pipeline execution:

  • exampleapp-design-v684 (active)
  • exampleapp-design-v683 (active)
  • exampleapp-design-v682 (disabled, scaled to zero)

So again shrinking worked but disabling not. I attach the (hopefully) complete orca logs of this run and the execution logs of this run. execution-1.25.2.json.gz extract-2021-03-26_14-03-05.csv.gz

And these are again the logs filtered by AbstractClusterWideClouddriverTask.

date | Status | logger_name | @stageType | message
-- | -- | -- | -- | --
2021-03-26T13:27:00.423Z | debug | com.netflix.spinnaker.orca.clouddriver.tasks.cluster.AbstractClusterWideClouddriverTask | shrinkCluster | Server groups fetched from cluster (exampleapp-design):   [exampleapp-design-v684, exampleapp-design-v683, exampleapp-design-v682,   exampleapp-design-v681]
2021-03-26T13:27:00.426Z | info | com.netflix.spinnaker.orca.clouddriver.tasks.cluster.AbstractClusterWideClouddriverTask | shrinkCluster | Preserving recently deployed server groups ()
2021-03-26T13:27:00.427Z | debug | com.netflix.spinnaker.orca.clouddriver.tasks.cluster.AbstractClusterWideClouddriverTask | shrinkCluster | Filtered cluster server groups (excluding parent deploys) in locations   [Location{type=REGION, value='eu-central-1'}]: [exampleapp-design-v681]
2021-03-26T13:27:00.427Z | debug | com.netflix.spinnaker.orca.clouddriver.tasks.cluster.AbstractClusterWideClouddriverTask | shrinkCluster | Kato ops for executionId (01F1QB9X7VG8GK5GWNKQ2E1CQ5):   [[destroyServerGroup:[credentials:org-example-design,   accountName:org-example-design, serverGroupName:exampleapp-design-v681,   asgName:exampleapp-design-v681, cloudProvider:aws, region:eu-central-1]]]
2021-03-26T13:27:07.016Z | debug | com.netflix.spinnaker.orca.clouddriver.tasks.cluster.AbstractClusterWideClouddriverTask | disableCluster | Server groups fetched from cluster (exampleapp-design):   [exampleapp-design-v683, exampleapp-design-v682]
2021-03-26T13:27:07.017Z | info | com.netflix.spinnaker.orca.clouddriver.tasks.cluster.AbstractClusterWideClouddriverTask | disableCluster | Preserving recently deployed server groups ()
2021-03-26T13:27:07.017Z | debug | com.netflix.spinnaker.orca.clouddriver.tasks.cluster.AbstractClusterWideClouddriverTask | disableCluster | Filtered cluster server groups (excluding parent deploys) in locations   [Location{type=REGION, value='eu-central-1'}]: []
2021-03-26T13:27:07.018Z | debug | com.netflix.spinnaker.orca.clouddriver.tasks.cluster.AbstractClusterWideClouddriverTask | disableCluster | Kato ops for executionId (01F1QB9X7VG8GK5GWNKQ2E1CQ5): []
2021-03-26T13:27:07.018Z | warn | com.netflix.spinnaker.orca.clouddriver.tasks.cluster.AbstractClusterWideClouddriverTask | disableCluster | 01F1QB9X7VG8GK5GWNKQ2E1CQ5: No server groups to operate on from   [Location{type=REGION, value='eu-central-1'}:[[buildInfo:[commit:1eb3db9,   package_name:exampleapp, version:2021.03.25],   instances:[[name:i-0b4f3b9f8e58d68ff, launchTime:1616764377000, health:[[healthClass:platform,   state:Unknown, type:Amazon], [application:exampleapp,   instanceId:i-0b4f3b9f8e58d68ff, loadBalancers:[[description:N/A,   healthState:Up, instanceId:i-0b4f3b9f8e58d68ff,   loadBalancerName:example-app-elb-spin, loadBalancerType:classic,   reasonCode:N/A, state:InService]], state:Up, type:LoadBalancer]],   healthState:Up, zone:eu-central-1b], [name:i-0262a301a949299ee,   launchTime:1616764377000, health:[[healthClass:platform, state:Unknown,   type:Amazon], [application:exampleapp, instanceId:i-0262a301a949299ee,   loadBalancers:[[description:N/A, healthState:Up,   instanceId:i-0262a301a949299ee, loadBalancerName:example-app-elb-spin,   loadBalancerType:classic, reasonCode:N/A, state:InService]], state:Up,   type:LoadBalancer]], healthState:Up, zone:eu-central-1a]],   instanceCounts:[down:0, outOfService:0, starting:0, total:2, unknown:0,   up:2], launchConfig:[application:exampleapp, associatePublicIpAddress:false,   blockDeviceMappings:[[deviceName:/dev/sda1, ebs:[deleteOnTermination:true,   volumeSize:30, volumeType:gp2]]], classicLinkVPCSecurityGroups:[],   createdTime:1616764373044, ebsOptimized:false,   iamInstanceProfile:arn:aws:iam::REDACTED:instance-profile/example-design-instanceprofiles-ref-commonprofile-UONB8HY9WO1W,   imageId:ami-REDACTED, instanceMonitoring:[enabled:true],   instanceType:t3a.small, kernelId:, keyName:aws-master.org.example-design,   launchConfigurationARN:arn:aws:autoscaling:eu-central-1:REDACTED:launchConfiguration:df790c56-33da-449e-be7f-975a6688ffae:launchConfigurationName/exampleapp-design-v683-03262021131252,   launchConfigurationName:exampleapp-design-v683-03262021131252, ramdiskId:,   securityGroups:[sg-0f001d66, sg-5cb1af31],   userData:IyEvYmluL2Jhc2gKY2F0ID4vZXRjL2RlZmF1bHQvc3Bpbi1tZXRhZGF0YSA8PEVPTAojIyBUaGVzZSBjb21lIHdpdGggZ2xvYmFsIGJvb3RzdHJhcC4KIyBNb3JlIGluZm9ybWF0aW9uOiBodHRwczovL3d3dy5zcGlubmFrZXIuaW8vc2V0dXAvZmVhdHVyZXMvdXNlci1kYXRhLyBhbmQgZnJvbSBJaXJvClNQSU5fQUNDT1VOVD0icmJtaC11aW0tZGVzaWduIgpTUElOX0FDQ09VTlRfVFlQRT0icmJtaC11aW0tZGVzaWduIgpTUElOX0VOVklST05NRU5UPSJyYm1oLXVpbS1kZXNpZ24iClNQSU5fQVBQTElDQVRJT049InVpbWFwcCIKU1BJTl9TRVJWRVJfR1JPVVA9InVpbWFwcC1kZXNpZ24tdjY4MyIKU1BJTl9DTFVTVEVSPSJ1aW1hcHAtZGVzaWduIgpTUElOX1NUQUNLPSJkZXNpZ24iClNQSU5fREVUQUlMPSIiClNQSU5fUkVHSU9OPSJldS1jZW50cmFsLTEiCkVPTAojIEFwcGxpY2F0aW9uLWJhc2VkIHVzZXItZGF0YSBzdGFydHMgZnJvbSBoZXJlLgoKIyEvYmluL2Jhc2ggLXhlCiMKIyBUaGUgYWN0dWFsIGJvb3RzdHJhcCBzY3JpcHQgaGFzIGJlZW4gbW92ZWQgdG8gcGFja2FnaW5nLiBJdCBpcyBhbHdheXMgZm91bmQgZnJvbToKIyAgIC91c3IvbG9jYWwvYmluL2Jvb3RzdHJhcC5zaAojCmVjaG8gIkV4ZWN1dGluZyB0aGUgYm9vdHN0cmFwLnNoIGZyb20gL3Vzci9sb2NhbC9iaW4uLi4iCi91c3IvbG9jYWwvYmluL2Jvb3RzdHJhcC5zaAo=],   type:aws, capacity:[desired:2, max:2, min:1, pinned:false],   scheduledActions:[], launchConfigName:exampleapp-design-v683-03262021131252,   cloudProvider:aws, vpcId:vpc-9e2d61f7, createdTime:1616764374236,   disabled:false, serverGroupManagers:[], image:[architecture:x86_64,   blockDeviceMappings:[[deviceName:/dev/sda1, ebs:[deleteOnTermination:true,   encrypted:false, snapshotId:snap-0844f442ba094afa1, volumeSize:8, volumeType:gp2]],   [deviceName:/dev/sdb, virtualName:ephemeral0], [deviceName:/dev/sdc,   virtualName:ephemeral1]], creationDate:2021-03-25T20:21:48.000Z,   enaSupport:true, hypervisor:xen, imageId:ami-REDACTED,   imageLocation:148880244575/exampleapp-all-20210325201802-bionic,   imageType:machine, name:exampleapp-all-20210325201802-bionic,   ownerId:148880244575, platformDetails:Linux/UNIX, productCodes:[],   public:false, rootDeviceName:/dev/sda1, rootDeviceType:ebs,   sriovNetSupport:simple, state:available, tags:[[key:build_host, value:],   [key:build_info_url, value:], [key:Name, value:example App],   [key:Description, value:Deployment AMI for example App.], [key:appversion,   value:exampleapp-2021.03.25-h15177.1eb3db9]], usageOperation:RunInstances,   virtualizationType:hvm], instanceType:t3a.small,   loadBalancers:[example-app-elb-spin], moniker:[app:exampleapp,   cluster:exampleapp-design, sequence:683, stack:design], zones:[eu-central-1a,   eu-central-1b], labels:[:], scalingPolicies:[],   asg:[autoScalingGroupARN:arn:aws:autoscaling:eu-central-1:REDACTED:autoScalingGroup:8406afbd-d41b-419f-b5fd-1f56600eef8f:autoScalingGroupName/exampleapp-design-v683,   autoScalingGroupName:exampleapp-design-v683,   availabilityZones:[eu-central-1a, eu-central-1b], createdTime:1616764374236,   defaultCooldown:10, desiredCapacity:2, enabledMetrics:[],   healthCheckGracePeriod:180, healthCheckType:ELB,   launchConfigurationName:exampleapp-design-v683-03262021131252,   loadBalancerNames:[example-app-elb-spin], maxSize:2, minSize:1,   newInstancesProtectedFromScaleIn:false,   serviceLinkedRoleARN:arn:aws:iam::REDACTED:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling,   suspendedProcesses:[], tags:[[key:Name, propagateAtLaunch:true,   resourceId:exampleapp-design-v683, resourceType:auto-scaling-group,   value:exampleapp], [key:buildVersion, propagateAtLaunch:true,   resourceId:exampleapp-design-v683, resourceType:auto-scaling-group,   value:2021.03.25-1eb3db9]], targetGroupARNs:[],   terminationPolicies:[default], vpczoneIdentifier:subnet-aaba9dc3,subnet-8d9be1f6],   application:exampleapp, targetGroups:[], name:exampleapp-design-v683,   securityGroups:[sg-0f001d66, sg-5cb1af31], region:eu-central-1],   [buildInfo:[commit:1eb3db9, package_name:exampleapp, version:2021.03.25],   instances:[], instanceCounts:[down:0, outOfService:0, starting:0, total:0,   unknown:0, up:0], launchConfig:[application:exampleapp,   associatePublicIpAddress:false, blockDeviceMappings:[[deviceName:/dev/sda1,   ebs:[deleteOnTermination:true, volumeSize:30, volumeType:gp2]]], classicLinkVPCSecurityGroups:[],   createdTime:1616746821405, ebsOptimized:false,   iamInstanceProfile:arn:aws:iam::REDACTED:instance-profile/example-design-instanceprofiles-ref-commonprofile-UONB8HY9WO1W,   imageId:ami-REDACTED, instanceMonitoring:[enabled:true], instanceType:t3a.small,   kernelId:, keyName:aws-master.org.example-design,   launchConfigurationARN:arn:aws:autoscaling:eu-central-1:REDACTED:launchConfiguration:8e2c2bd8-070c-4da4-8b34-e072e1810dcd:launchConfigurationName/exampleapp-design-v682-03262021082020,   launchConfigurationName:exampleapp-design-v682-03262021082020, ramdiskId:,   securityGroups:[sg-0f001d66, sg-5cb1af31],   userData:IyEvYmluL2Jhc2gKY2F0ID4vZXRjL2RlZmF1bHQvc3Bpbi1tZXRhZGF0YSA8PEVPTAojIyBUaGVzZSBjb21lIHdpdGggZ2xvYmFsIGJvb3RzdHJhcC4KIyBNb3JlIGluZm9ybWF0aW9uOiBodHRwczovL3d3dy5zcGlubmFrZXIuaW8vc2V0dXAvZmVhdHVyZXMvdXNlci1kYXRhLyBhbmQgZnJvbSBJaXJvClNQSU5fQUNDT1VOVD0icmJtaC11aW0tZGVzaWduIgpTUElOX0FDQ09VTlRfVFlQRT0icmJtaC11aW0tZGVzaWduIgpTUElOX0VOVklST05NRU5UPSJyYm1oLXVpbS1kZXNpZ24iClNQSU5fQVBQTElDQVRJT049InVpbWFwcCIKU1BJTl9TRVJWRVJfR1JPVVA9InVpbWFwcC1kZXNpZ24tdjY4MiIKU1BJTl9DTFVTVEVSPSJ1aW1hcHAtZGVzaWduIgpTUElOX1NUQUNLPSJkZXNpZ24iClNQSU5fREVUQUlMPSIiClNQSU5fUkVHSU9OPSJldS1jZW50cmFsLTEiCkVPTAojIEFwcGxpY2F0aW9uLWJhc2VkIHVzZXItZGF0YSBzdGFydHMgZnJvbSBoZXJlLgoKIyEvYmluL2Jhc2ggLXhlCiMKIyBUaGUgYWN0dWFsIGJvb3RzdHJhcCBzY3JpcHQgaGFzIGJlZW4gbW92ZWQgdG8gcGFja2FnaW5nLiBJdCBpcyBhbHdheXMgZm91bmQgZnJvbToKIyAgIC91c3IvbG9jYWwvYmluL2Jvb3RzdHJhcC5zaAojCmVjaG8gIkV4ZWN1dGluZyB0aGUgYm9vdHN0cmFwLnNoIGZyb20gL3Vzci9sb2NhbC9iaW4uLi4iCi91c3IvbG9jYWwvYmluL2Jvb3RzdHJhcC5zaAo=],   type:aws, capacity:[desired:0, max:0, min:0, pinned:true],   scheduledActions:[], launchConfigName:exampleapp-design-v682-03262021082020,   cloudProvider:aws, vpcId:vpc-9e2d61f7, createdTime:1616746822594,   disabled:true, serverGroupManagers:[], image:[architecture:x86_64,   blockDeviceMappings:[[deviceName:/dev/sda1, ebs:[deleteOnTermination:true,   encrypted:false, snapshotId:snap-0844f442ba094afa1, volumeSize:8,   volumeType:gp2]], [deviceName:/dev/sdb, virtualName:ephemeral0],   [deviceName:/dev/sdc, virtualName:ephemeral1]],   creationDate:2021-03-25T20:21:48.000Z, enaSupport:true, hypervisor:xen,   imageId:ami-REDACTED,   imageLocation:148880244575/exampleapp-all-20210325201802-bionic,   imageType:machine, name:exampleapp-all-20210325201802-bionic,   ownerId:148880244575, platformDetails:Linux/UNIX, productCodes:[],   public:false, rootDeviceName:/dev/sda1, rootDeviceType:ebs,   sriovNetSupport:simple, state:available, tags:[[key:build_host, value:],   [key:build_info_url, value:], [key:Name, value:example App],   [key:Description, value:Deployment AMI for example App.], [key:appversion,   value:exampleapp-2021.03.25-h15177.1eb3db9]], usageOperation:RunInstances,   virtualizationType:hvm], instanceType:t3a.small, loadBalancers:[example-app-elb-spin],   moniker:[app:exampleapp, cluster:exampleapp-design, sequence:682,   stack:design], zones:[eu-central-1a, eu-central-1b], labels:[:],   scalingPolicies:[],   asg:[autoScalingGroupARN:arn:aws:autoscaling:eu-central-1:REDACTED:autoScalingGroup:f79c6ae5-3323-40eb-989c-e9dc6bad31cc:autoScalingGroupName/exampleapp-design-v682,   autoScalingGroupName:exampleapp-design-v682,   availabilityZones:[eu-central-1a, eu-central-1b], createdTime:1616746822594,   defaultCooldown:10, desiredCapacity:0, enabledMetrics:[],   healthCheckGracePeriod:180, healthCheckType:ELB,   launchConfigurationName:exampleapp-design-v682-03262021082020,   loadBalancerNames:[example-app-elb-spin], maxSize:0, minSize:0,   newInstancesProtectedFromScaleIn:false, serviceLinkedRoleARN:arn:aws:iam::REDACTED:role/aws-service-role/autoscaling.amazonaws.com/AWSServiceRoleForAutoScaling,   suspendedProcesses:[[processName:Terminate, suspensionReason:User suspended   at 2021-03-26T13:21:44Z], [processName:AddToLoadBalancer,   suspensionReason:User suspended at 2021-03-26T13:21:44Z],   [processName:Launch, suspensionReason:User suspended at   2021-03-26T13:21:44Z]], tags:[[key:Name, propagateAtLaunch:true,   resourceId:exampleapp-design-v682, resourceType:auto-scaling-group,   value:exampleapp], [key:buildVersion, propagateAtLaunch:true,   resourceId:exampleapp-design-v682, resourceType:auto-scaling-group,   value:2021.03.25-1eb3db9]], targetGroupARNs:[],   terminationPolicies:[default],   vpczoneIdentifier:subnet-aaba9dc3,subnet-8d9be1f6], application:exampleapp,   targetGroups:[], name:exampleapp-design-v682, securityGroups:[sg-0f001d66,   sg-5cb1af31], region:eu-central-1]]] in [Location{type=REGION,   value='eu-central-1'}]
2021-03-26T13:27:08.165Z | debug | com.netflix.spinnaker.orca.clouddriver.tasks.cluster.AbstractClusterWideClouddriverTask | scaleDownCluster | Server groups fetched from cluster (exampleapp-design):   [exampleapp-design-v683, exampleapp-design-v682]
2021-03-26T13:27:08.166Z | info | com.netflix.spinnaker.orca.clouddriver.tasks.cluster.AbstractClusterWideClouddriverTask | scaleDownCluster | Preserving recently deployed server groups ()
2021-03-26T13:27:08.170Z | debug | com.netflix.spinnaker.orca.clouddriver.tasks.cluster.AbstractClusterWideClouddriverTask | scaleDownCluster | Filtered cluster server groups (excluding parent deploys) in locations   [Location{type=REGION, value='eu-central-1'}]: [exampleapp-design-v682]
2021-03-26T13:27:08.170Z | debug | com.netflix.spinnaker.orca.clouddriver.tasks.cluster.AbstractClusterWideClouddriverTask | scaleDownCluster | Kato ops for executionId (01F1QB9X7VG8GK5GWNKQ2E1CQ5):   [[resumeAsgProcessesDescription:[credentials:org-example-design,   accountName:org-example-design, serverGroupName:exampleapp-design-v682,   asgName:exampleapp-design-v682, cloudProvider:aws, region:eu-central-1,   processes:[Terminate]]], [resizeServerGroup:[credentials:org-example-design,   accountName:org-example-design, serverGroupName:exampleapp-design-v682,   asgName:exampleapp-design-v682, cloudProvider:aws, region:eu-central-1,   capacity:[min:0, max:0, desired:0]]]]

Seems your guess was right. The new servergroup is not part of the array at disableCluster and also at scaleDownCluster, but it is present at shrinkCluster stage.

Is this something you can work with? Thanks for your support, Johannes

@dreynaud Update: Maybe this is what you’re asking for, I attached the downloaded json when clicking the “source” link within the UI pipeline execution detail. If that’s not the correct file, tell me, please. execution-exampleapp.json.gz

And there are already a lot of orca logs in the CSV attachment of my first comment here. With Datadog I had no other option than exporting them to CSV. Are you still missing something?