external-dns: Performance regression with the AWS provider in 0.5.9

Hello

We tried to update our deployment to 0.5.9 but we noticed a significant performance regression due to https://github.com/kubernetes-incubator/external-dns/pull/742 After this PR, for each change we call Records() instead of calling it once per plan https://github.com/kubernetes-incubator/external-dns/blob/v0.5.9/provider/aws.go#L367 which will retrieve all records for the zone.

A simple fix could be to store the result of the Records() call in the AWSProvider. I can PR this if you think it makes sense.

In addition, the call to ListResourceRecordSetsPages in the Records() function is paginated but the aws go sdk does not take into account rate limits. My understanding is that each page is 100 records which means on large zone we can quickly hit the rate limit which is 5 rps (from https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/DNSLimitations.html#limits-api-requests)

We could retry and backoff on rate-limit errors but given how low the limit is and how simple the call pattern is, we could also simply sleep for 200ms (or 250ms to be safe, or even make it configurable) at the end of the callback.

Laurent

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 4
  • Comments: 17 (12 by maintainers)

Most upvoted comments

I think this is affecting us as well. However, we didn’t experience this until 0.5.10. Upon upgrading to 0.5.10, ALL domains in the cluster were updated. I saw this happen in both our development and staging clusters. However, we didn’t experience any issues until I deployed this in staging. Now both clusters are complaining of getting records failed: Throttling: Rate exceeded.

I’m not sure why any of the domains needed to be updated anyways given there were no changes other than upgrading from 0.5.9 to 0.5.10. In 0.5.9 we only really saw All records are already up to date as nothing was changed.

Update: downgrading to 0.5.9 resolves the issue I am experiencing. Downgrading returned us back to the all records are already up to date message instead of it trying to update every record.