upgrade-manager: Upgrade manager stuck in no InService instances.
Is this a BUG REPORT or FEATURE REQUEST?: BUG REPORT
What happened: We are using release v1.0.4 of upgrade manager. The manager is not able to complete the rollout, especially for cluster with nodes roughly greater than 20.
We’ve seen couple of errors in the logs
- failed to set instances to stand-by:
{"level":"info","ts":1649087216.0083492,"logger":"controllers.RollingUpgrade","msg":"failed to set instances to stand-by","instances":[{"AvailabilityZone":"us-west-2b","HealthStatus":"Healthy","InstanceId":"i-0c485e03bd870299e","InstanceType":"c6i.4xlarge","LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-02f0c454eec658bb0","LaunchTemplateName":"lt-k8s-1-020220120120625129100000006","Version":"2"},"LifecycleState":"InService","ProtectedFromScaleIn":true,"WeightedCapacity":null},{"AvailabilityZone":"us-west-2b","HealthStatus":"Healthy","InstanceId":"i-097374badf1782ccb","InstanceType":"c6i.4xlarge","LaunchConfigurationName":null,"LaunchTemplate":{"LaunchTemplateId":"lt-02f0c454eec658bb0","LaunchTemplateName":"lt-k8s-1-020220120120625129100000006","Version":"2"},"LifecycleState":"Standby","ProtectedFromScaleIn":true,"WeightedCapacity":null}],"message":"ValidationError: The instance i-097374badf1782ccb is not in InService.\n\tstatus code: 400, request id: a477d57c-af4e-44a6-8f3b-f89d711e1f35","name":"upgrade-manager/asg-k8s-1-02022012012062550510000000a"}
- no InService instances in the batch:
{"level":"info","ts":1649160753.7560294,"logger":"controllers.RollingUpgrade","msg":"selecting batch for rotation","batch size":1,"name":"upgrade-manager/asg-stage-k8s-1-420220201085050052200000045"}
{"level":"info","ts":1649160753.7560575,"logger":"controllers.RollingUpgrade","msg":"rotating batch","instances":["i-0738c1f7e01cf2ce7"],"name":"upgrade-manager/asg-stage-k8s-1-420220201085050052200000045"}
{"level":"info","ts":1649160753.7560735,"logger":"controllers.RollingUpgrade","msg":"no InService instances in the batch","batch":["i-0738c1f7e01cf2ce7"],"instances(InService)":[],"name":"upgrade-manager/asg-stage-k8s-1-420220201085050052200000045"}
In both cases, these logs start to repeat till manager fails. Failure time is around in 1hr. When we look at the ASG, the nodes in the logs are in standby state. The logs in upgrade manager for setting node to standby and the time in ASG match so we can say upgrade manager did set the nodes to stand by. However upgrade manager start sending out above logs and is stuck in that error till it fails. Manager shows the same logs even if we manually drain and delete the node.
What you expected to happen: Upgrade manager should rollout the nodes.
Environment:
- rolling-upgrade-controller version 1.0.4
- Kubernetes version : 1.21
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (8 by maintainers)
@shreyas-badiger Done.
@shreyas-badiger Thanks a lot for detailed explanation … That was really helpful … I will do the change you suggested and test it out on the cluster. If works well, I’ll add PR here …
About batch, we have not specified any strategy type, so if & elseif in https://github.com/keikoproj/upgrade-manager/blob/8e0f67db323ec5b7bea0fd4d9f96d23f499e7e66/controllers/upgrade.go#L419 is not coming into picture … So the result of CalculateMaxUnavailable is getting used.
We’ve 6 total nodes in 1 asg & maxUnavailable is 20%. intstr.GetValueFromIntOrPercent is producing 2 over 1.2 as its using “ceil”.
Thanks @ameyajoshi99 We’ve addressed some issues around LaunchTemplate caching in https://github.com/keikoproj/upgrade-manager/pull/322 which is not in the latest release. Could you try out :master tag and see if that works better? We can create a release with this fix if needed.
Also, you mention that all instances in StandBy - are there no new instances inservice? when an instance is set to standby, a new one should automatically launch and should be InService. Can you look at the ASG’s activity history to see if there was a failure to launch new instances for some reason?
CC @shreyas-badiger