driftctl: ThrottlingException: Rate exceeded

Hi team, do you know how we can avoid the rate exceeded error?

Scanned states (7)      
ThrottlingException: Rate exceeded
        status code: 400, request id: 0474b16c-faee-402a-bf01-1e2a7c005714

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 15
Comments: 19 (6 by maintainers)

Most upvoted comments

I think how long the program runs is less important, so backoff/retry is a good thing. We are running driftctl on an EKS cluster with a Python wrapper as a pod launched by a cronjob. So if it takes an hour to run, it does not matter, if you run it every 12h. The wrapper compares the driftctl json output to expected output, and emails if there are diffs. We then have a stern talk with those AWS console users that did not use terraform for making the changes. I mostly care about IAM and security group changes, and if you limit driftctl to those, it is generally not API rate limited.

Best, F.

brunzefb on Jun 13, 2022

I was able to get past this error by implementing exponential backoff in the repository that was triggering the throttle exception. In my case, it was API Gateway limits I was hitting. You can see here: https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html that API Gateway allows 5 requests every 2 seconds per account for GetResources and I was hitting that limit pretty frequently. There is also a 10 request per second limit across all API Gateway management operations. To work around those limits, I added some code to the api_gateway_repository.go file that would exponentially backoff the requests in the case that we received a “TooManyRequestsException” error. I set the bar at 2 seconds since that was the limit we were hitting. Also, I had to add this logic to every function making a request to API Gateway since any of them could trigger the total operation limitation. (e.g. GetRestApisPage reaches the limit then a call to say GetAccount will trigger the throttle). Here is the logic for the GetRestApisPages as an example.

  const MaxRetries = 5
  
  if err != nil {
	  retries := 0
	  retry := true
  
	  for retry && retries < MaxRetries {
		  sleepTime := time.Duration(math.Pow(2, float64(retries))) * 2 * time.Second
		  logrus.Warn("Error caught during GetRestApisPages! Attempt number ", retries+1, "/", MaxRetries, ". Retrying after sleeping for ", sleepTime, "...")
		  time.Sleep(sleepTime)
		  logrus.Debug("Awake! Attempting to make GetRestApisPages call again.")
		  err = r.client.GetRestApisPages(&input,
			  func(resp *apigateway.GetRestApisOutput, lastPage bool) bool {
				  restApis = append(restApis, resp.Items...)
				  return !lastPage
			  },
		  )
		  if err != nil && strings.Contains(err.Error(), "TooManyRequestsException") {
			  retry = true
		  } else {
			  retry = false
		  }
  
		  retries++
	  }
  }

To reduce duplicate code, I implemented a function


func retryOnFailure(callback func() error, message string) error {
	retries := 0
	retry := true

	var err error
	for retry && retries < MaxRetries {
		sleepTime := time.Duration(math.Pow(2, float64(retries))) * 2 * time.Second
		logrus.Warn(message, "Attempt number ", retries+1, "/", MaxRetries, ". Retrying after sleeping for ", sleepTime, "...")
		time.Sleep(sleepTime)
		logrus.Debug("Awake! Attempting to make API call again.")

		err = callback()
		if err != nil && strings.Contains(err.Error(), "TooManyRequestsException") {
			retry = true
		} else {
			retry = false
		}

		retries++
	}
	return err
}

and now I can check for error on the first call, then go into exponential backoff if there was an error

if err != nil {
		err = retryOnFailure(func() error {
			logrus.Debug("Making a call to get rest APIs not found in cache")
			err = r.client.GetRestApisPages(&input,
				func(resp *apigateway.GetRestApisOutput, lastPage bool) bool {
					restApis = append(restApis, resp.Items...)
					return !lastPage
				},
			)
			return err
		}, "Error caught during GetRestApisPages!")
	}

I’m happy to contribute this code to the project if everyone thinks it will be helpful. This logic should probably be implemented in other places/repositories too…

drem-darios on Mar 8, 2023

Very valuable feedback @gmaghera thanks 🙏🏻

We are very sorry that we could not share any status update on that 😟 We are currently in a complicated context regarding driftctl, the company behind it (cloudskiff) has been acquired one year ago and now our focus is currently not on actively improving driftctl. Also unfortunately we made some changes that had put driftctl in a state where it’s kinda complicated to work on for newcomers, so giving that issue to the community does not sounds like an decent option.

We’ll keep you updated as soon as we could on that 🙏🏻

eliecharra on Oct 18, 2022

Using cpulimit may help to slow the rate of API calls down by throttling the CPU usage of the app as a whole.

Example, limit CPU usage to 25%:

cpulimit -l 25 driftctl scan --from tfstate://*.tfstate

Using cgroups would work better than this I think at the cost of being a bit more involved to apply.

Just playing around with cpulimit a little bit, limiting to about 5% limit on my machine doubles the scan time and stops the throttle errors. Going any lower (like all the way to 1%) causes some aws authentication errors to start being thrown, presumably because the app doesn’t respond quickly enough for some of the handshakes or API flows.

So it seems like there is a sweet spot with something as simple as cpulimit to help with this – at least for my machine anyway.

johnalotoski on Feb 3, 2023

Retry will address this issue. When we encounter a rate limit issue, we’ll create an exponential backoff retry loop so requests will be postponed and the scan will take longed but will not be interrupted anymore. @moadibfr Is working on that, but we are also currently splitting up the enumeration from driftctl in a separate go module for a better separation of concern so it’ll take time for the retry on rate limit mechanism to be implemented.

Would it be possible to break what driftclt does into batches.

That sounds complicated because the goal of driftctl is to enumerate resources, so you cannot batch a list if you do not have the list yet. We can think of another batching logic by using resources types for example, you can achieve this manually with the driftignore file, look my answer above.

We are aware that this is a very important pain for many of you and this rate limit issues is definitively on our plate 🙏🏻

eliecharra on Jun 7, 2022

Although the main issue is definitely not solved, there are a couple of helpful flags here that you can use to limit the scope you want to monitor in order to avoid the rate limit exception.

https://docs.driftctl.com/next/usage/cmd/scan-usage/

bshramin on Feb 20, 2023

@sundowndev Just ran into the throttling issue as well with driftctl. Created a support call with AWS to increase API allowed rate. They told me it would be too many rates to increase, and ask the authors to implement an exponential backoff when making AWS calls that hit the Throttling exception. While this may kill the performance of the tool, maybe that does not matter so much – especially if you are running it as a cron job once a day.

From AWS Support:

We would generally suggest that API calls should be made with a retry and exponential backoff in order to gracefully handle throttling when it occurs [2]. When narrowing down to calls from your IAM user around the reported times, I see a very aggressive call rate which suggests to me that this tool is not implementing such a backoff and retry strategy, or if it is, it is not retrying enough, or is not backing off enough. This strategy should work well with supported providers.

brunzefb on Apr 25, 2022

Hi @Arisfx, the rate limit error can occur when your cloud have a huge amount of resources, even when not managed by Terraform. This is part of known limitations of driftctl. Are you running in deep mode ? If yes, can you consider running in non-deep mode instead ? Note driftctl will no longer be able to show drifts in attributes. If you’ve identified which resource(s) is causing this, you can try to ignore this particular resource type using the driftignore or the filter flag. If none of those solutions fit your needs, could you give further details on your use case ?

Thanks 🙏🏻

sundowndev on Feb 16, 2022