external-dns: AWS provider triggers lots of "Throttling: Rate exceeded\n\tstatus code: 400"

This issue is related to #891 It has not been fixed by #966

AWS limits 5 api requests to Rout53 per account every second. Details here.

If hosted zone ids or domain names are not provided, AWS provider list records of all hosted zones. See code here.

For account with multiple hosted zones (or multiple EKS clustesr with external-dns), this results in Throttling: Rate exceeded\n\tstatus code: 400 every minute and dns changes do not propagate on schedule.

We can provide hosted zone ids or domain names to external-dns at startup, however that ties cluster infrastructure with services being deployed on the cluster. That’s not ideal.

This article describes how to deal with api request limits.

Few suggestions to fix this:

Do not list records from a hosted zone if none of the endpoints being managed are in hosted zone domain or subdomains
Look for this error and do exponential backoff

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 51
Comments: 33 (13 by maintainers)

Commits related to this issue

Add cache at provider level **Description** In the current implementation, DNS providers are called to list all records on every loop. This is expensive in terms of number of requests to the provide... — committed to tjamet/external-dns by deleted user a year ago
Add cache at provider level **Description** In the current implementation, DNS providers are called to list all records on every loop. This is expensive in terms of number of requests to the provide... — committed to tjamet/external-dns by deleted user a year ago

Most upvoted comments

For what it’s worth, I worked around the issue by using this external-dns Helm chart with these values: https://artifacthub.io/packages/helm/bitnami/external-dns

interval: 10m
triggerLoopOnEvent: true

The default interval is 1 minute, which is really short - it spams Route53, which gives us these rate exceeded messages. I set mine to 10 minutes, and then turned on triggerLoopOnEvent, so I don’t actually have to wait 10 minutes for it to create a record. It creates records immediately when the ingresses are created. I could probably turn the interval up to 60 minutes or higher at this point. I haven’t experimented more with it because I no longer have the problem…

evanstucker-hates-2fa on Mar 13, 2021

Hi! We are also suffering hardly this issue. On our side, we have deployed a temporary fork of external-dns with several improvements I am currently backporting and contributing to the main code.

We currently have opened #2009 and #2010 that, combined, help reduce the total amount of AWS calls by a factor of around 6 under specific circumstances.

Currently, the API call rate is only reduced by a factor of 3 if several instances of external-dns are run towards the same zone and different owner IDs. We are working to contribute the fixes for those as well. Either by the use of #1724 that should help in this area or other means.

tjamet on Mar 12, 2021

Just a quick request for an update, is there any estimated date that this will proceed.

cmsmith7 on Apr 30, 2021

We are also using the event flag, with a sync period of 10 minutes We have 8000+ objects monitored in 6 clusters, using the same zone.

With the current Minimum Sync Interval of 5 seconds, and the necessity of listing all records at least once during the sync, it raises the number of calls to the AWS API.

With this in mind, we are now running, in addition to the changes of #2009 and #2010, a change where the MinInterval has been set to a configurable 1 minute proposed in #2013, and few more improvements relying on #1724

All of those combined leads us to a significant decrease in API calls image (2)

And made our whole registry errors disappear:

tjamet on Mar 16, 2021

/remove-lifecycle stale Having this issue as well.

dlaidlaw on Mar 2, 2021