spinnaker: TaskHealthCachingAgent failing for ECS for large number of ECS tasks

Issue Summary:

Getting the following error when the ECS provider account has large number of tasks running. In this case we have about 7500+ tasks spread across 16 different ECS clusters.

2019-06-06 15:18:32.740  WARN 9929 --- [cutionAction-73] c.n.s.c.cache.LoggingInstrumentation     : com.netflix.spinnaker.clouddriver.ecs.provider.EcsProvider:ecs-production/ap-south-1/TaskHealthCachingAgent completed with one or more failures

com.amazonaws.services.elasticloadbalancingv2.model.AmazonElasticLoadBalancingException: Rate exceeded (Service: AmazonElasticLoadBalancing; Status Code: 400; Error Code: Throttling; Request ID: 58aec762-886e-11e9-8c59-95465c457f75)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712) ~[aws-java-sdk-core-1.11.534.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367) ~[aws-java-sdk-core-1.11.534.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113) ~[aws-java-sdk-core-1.11.534.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770) ~[aws-java-sdk-core-1.11.534.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744) ~[aws-java-sdk-core-1.11.534.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726) ~[aws-java-sdk-core-1.11.534.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686) ~[aws-java-sdk-core-1.11.534.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668) ~[aws-java-sdk-core-1.11.534.jar:na]
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532) ~[aws-java-sdk-core-1.11.534.jar:na]
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512) ~[aws-java-sdk-core-1.11.534.jar:na]
    at com.amazonaws.services.elasticloadbalancingv2.AmazonElasticLoadBalancingClient.doInvoke(AmazonElasticLoadBalancingClient.java:2715) ~[aws-java-sdk-elasticloadbalancingv2-1.11.534.jar:na]

Cloud Provider(s):

ECS

Environment:

all three: Kubernetes, debian local, local git

Feature Area (if this issue is UI/UX related, please tag @spinnaker/ui-ux-team):

Description:

Steps to Reproduce:

Additional Details:

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 21 (6 by maintainers)

Most upvoted comments

As I’ve been thinking about this, we at least need to make the caching logic for ECS task LB health check logic more efficient. Currently it calls DescribeTargetHealth for each task, severely limiting the rate at which task health is cached. Instead, we should at least have a dedicated caching agent similar to AmazonApplicationLoadBalancerCachingAgent, where it describes all targets for the target group (single DescribeTargetHealth call) and caches all the target healths returned, instead of getting target health one-by-one.

https://github.com/spinnaker/clouddriver/blob/d15d7f775c30056510aca016c712e1d68c16d51a/clouddriver-ecs/src/main/java/com/netflix/spinnaker/clouddriver/ecs/provider/agent/TaskHealthCachingAgent.java#L250

https://github.com/spinnaker/clouddriver/blob/d15d7f775c30056510aca016c712e1d68c16d51a/clouddriver-ecs/src/main/java/com/netflix/spinnaker/clouddriver/ecs/provider/agent/TaskHealthCachingAgent.java#L177

@spinnakerbot remove-label stale

FYI, this fix ended up being more complicated to implement than we originally thought, so it will not make it into Spinnaker 1.18. We are still actively working on it

FYI to test this out, you can update to the latest on the master branch with:

hal config version edit --version master-latest-unvalidated
hal deploy apply

Closing this since https://github.com/spinnaker/clouddriver/pull/4274 and https://github.com/spinnaker/clouddriver/pull/4275 address the major elasticloadbalancing:describe-target-health bottle neck and will be available in 1.19.0.

If you use 1.19.0 or later and are still experiencing issues with throttling or large numbers of resources, please create a new issue and describe what APIs you’re having issues with (data like this is extremely helpful!) and how many tasks/services you have in your account so we can investigate.