driftctl: ThrottlingException: Rate exceeded
Hi team, do you know how we can avoid the rate exceeded error?
Scanned states (7)
ThrottlingException: Rate exceeded
status code: 400, request id: 0474b16c-faee-402a-bf01-1e2a7c005714
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 15
- Comments: 19 (6 by maintainers)
I think how long the program runs is less important, so backoff/retry is a good thing. We are running driftctl on an EKS cluster with a Python wrapper as a pod launched by a cronjob. So if it takes an hour to run, it does not matter, if you run it every 12h. The wrapper compares the driftctl json output to expected output, and emails if there are diffs. We then have a stern talk with those AWS console users that did not use terraform for making the changes. I mostly care about IAM and security group changes, and if you limit driftctl to those, it is generally not API rate limited.
Best, F.
I was able to get past this error by implementing exponential backoff in the repository that was triggering the throttle exception. In my case, it was API Gateway limits I was hitting. You can see here: https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html that API Gateway allows 5 requests every 2 seconds per account for GetResources and I was hitting that limit pretty frequently. There is also a 10 request per second limit across all API Gateway management operations. To work around those limits, I added some code to the
api_gateway_repository.gofile that would exponentially backoff the requests in the case that we received a “TooManyRequestsException” error. I set the bar at 2 seconds since that was the limit we were hitting. Also, I had to add this logic to every function making a request to API Gateway since any of them could trigger the total operation limitation. (e.g. GetRestApisPage reaches the limit then a call to say GetAccount will trigger the throttle). Here is the logic for the GetRestApisPages as an example.To reduce duplicate code, I implemented a function
and now I can check for error on the first call, then go into exponential backoff if there was an error
I’m happy to contribute this code to the project if everyone thinks it will be helpful. This logic should probably be implemented in other places/repositories too…
Very valuable feedback @gmaghera thanks 🙏🏻
We are very sorry that we could not share any status update on that 😟 We are currently in a complicated context regarding driftctl, the company behind it (cloudskiff) has been acquired one year ago and now our focus is currently not on actively improving driftctl. Also unfortunately we made some changes that had put driftctl in a state where it’s kinda complicated to work on for newcomers, so giving that issue to the community does not sounds like an decent option.
We’ll keep you updated as soon as we could on that 🙏🏻
Using cpulimit may help to slow the rate of API calls down by throttling the CPU usage of the app as a whole.
Example, limit CPU usage to 25%:
Using cgroups would work better than this I think at the cost of being a bit more involved to apply.
Just playing around with cpulimit a little bit, limiting to about 5% limit on my machine doubles the scan time and stops the throttle errors. Going any lower (like all the way to 1%) causes some aws authentication errors to start being thrown, presumably because the app doesn’t respond quickly enough for some of the handshakes or API flows.
So it seems like there is a sweet spot with something as simple as cpulimit to help with this – at least for my machine anyway.
Retry will address this issue. When we encounter a rate limit issue, we’ll create an exponential backoff retry loop so requests will be postponed and the scan will take longed but will not be interrupted anymore. @moadibfr Is working on that, but we are also currently splitting up the enumeration from driftctl in a separate go module for a better separation of concern so it’ll take time for the retry on rate limit mechanism to be implemented.
That sounds complicated because the goal of driftctl is to enumerate resources, so you cannot batch a list if you do not have the list yet. We can think of another batching logic by using resources types for example, you can achieve this manually with the driftignore file, look my answer above.
We are aware that this is a very important pain for many of you and this rate limit issues is definitively on our plate 🙏🏻
Although the main issue is definitely not solved, there are a couple of helpful flags here that you can use to limit the scope you want to monitor in order to avoid the rate limit exception.
https://docs.driftctl.com/next/usage/cmd/scan-usage/
@sundowndev Just ran into the throttling issue as well with driftctl. Created a support call with AWS to increase API allowed rate. They told me it would be too many rates to increase, and ask the authors to implement an exponential backoff when making AWS calls that hit the Throttling exception. While this may kill the performance of the tool, maybe that does not matter so much – especially if you are running it as a cron job once a day.
We would generally suggest that API calls should be made with a retry and exponential backoff in order to gracefully handle throttling when it occurs [2]. When narrowing down to calls from your IAM user around the reported times, I see a very aggressive call rate which suggests to me that this tool is not implementing such a backoff and retry strategy, or if it is, it is not retrying enough, or is not backing off enough. This strategy should work well with supported providers.
Hi @Arisfx, the rate limit error can occur when your cloud have a huge amount of resources, even when not managed by Terraform. This is part of known limitations of driftctl. Are you running in deep mode ? If yes, can you consider running in non-deep mode instead ? Note driftctl will no longer be able to show drifts in attributes. If you’ve identified which resource(s) is causing this, you can try to ignore this particular resource type using the driftignore or the filter flag. If none of those solutions fit your needs, could you give further details on your use case ?
Thanks 🙏🏻