spinnaker: TaskHealthCachingAgent failing for ECS for large number of ECS tasks
Issue Summary:
Getting the following error when the ECS provider account has large number of tasks running. In this case we have about 7500+ tasks spread across 16 different ECS clusters.
2019-06-06 15:18:32.740 WARN 9929 --- [cutionAction-73] c.n.s.c.cache.LoggingInstrumentation : com.netflix.spinnaker.clouddriver.ecs.provider.EcsProvider:ecs-production/ap-south-1/TaskHealthCachingAgent completed with one or more failures
com.amazonaws.services.elasticloadbalancingv2.model.AmazonElasticLoadBalancingException: Rate exceeded (Service: AmazonElasticLoadBalancing; Status Code: 400; Error Code: Throttling; Request ID: 58aec762-886e-11e9-8c59-95465c457f75)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712) ~[aws-java-sdk-core-1.11.534.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367) ~[aws-java-sdk-core-1.11.534.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113) ~[aws-java-sdk-core-1.11.534.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770) ~[aws-java-sdk-core-1.11.534.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744) ~[aws-java-sdk-core-1.11.534.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726) ~[aws-java-sdk-core-1.11.534.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686) ~[aws-java-sdk-core-1.11.534.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668) ~[aws-java-sdk-core-1.11.534.jar:na]
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532) ~[aws-java-sdk-core-1.11.534.jar:na]
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512) ~[aws-java-sdk-core-1.11.534.jar:na]
at com.amazonaws.services.elasticloadbalancingv2.AmazonElasticLoadBalancingClient.doInvoke(AmazonElasticLoadBalancingClient.java:2715) ~[aws-java-sdk-elasticloadbalancingv2-1.11.534.jar:na]
Cloud Provider(s):
ECS
Environment:
all three: Kubernetes, debian local, local git
Feature Area (if this issue is UI/UX related, please tag @spinnaker/ui-ux-team):
Description:
Steps to Reproduce:
Additional Details:
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 21 (6 by maintainers)
As I’ve been thinking about this, we at least need to make the caching logic for ECS task LB health check logic more efficient. Currently it calls DescribeTargetHealth for each task, severely limiting the rate at which task health is cached. Instead, we should at least have a dedicated caching agent similar to AmazonApplicationLoadBalancerCachingAgent, where it describes all targets for the target group (single DescribeTargetHealth call) and caches all the target healths returned, instead of getting target health one-by-one.
https://github.com/spinnaker/clouddriver/blob/d15d7f775c30056510aca016c712e1d68c16d51a/clouddriver-ecs/src/main/java/com/netflix/spinnaker/clouddriver/ecs/provider/agent/TaskHealthCachingAgent.java#L250
https://github.com/spinnaker/clouddriver/blob/d15d7f775c30056510aca016c712e1d68c16d51a/clouddriver-ecs/src/main/java/com/netflix/spinnaker/clouddriver/ecs/provider/agent/TaskHealthCachingAgent.java#L177
@spinnakerbot remove-label stale
FYI, this fix ended up being more complicated to implement than we originally thought, so it will not make it into Spinnaker 1.18. We are still actively working on it
FYI to test this out, you can update to the latest on the master branch with:
Closing this since https://github.com/spinnaker/clouddriver/pull/4274 and https://github.com/spinnaker/clouddriver/pull/4275 address the major
elasticloadbalancing:describe-target-healthbottle neck and will be available in 1.19.0.If you use 1.19.0 or later and are still experiencing issues with throttling or large numbers of resources, please create a new issue and describe what APIs you’re having issues with (data like this is extremely helpful!) and how many tasks/services you have in your account so we can investigate.