external-dns: Cloudflare provider causes repeated UPDATE

What happened: We’re using the Cloudflare provider, and we’re seeing it update all of the DNS records every minute. This is disruptive because when updating Cloudflare, the record is momentarily deleted; if something caches the NXDOMAIN that comes back from resolving the record during that period, it can be very annoying (the TTL on Cloudflare’s NXDOMAIN responses is not adjustable, nor is it particularly short).

What you expected to happen: We’d like it to update only when we change things 😃

How to reproduce it (as minimally and precisely as possible): Try to use the Cloudflare provider with External DNS 0.5.18 or higher (master works), and with it set to not proxy by default.

Anything else we need to know?: Yes. The problem is that the Cloudflare provider always returns its annotation, external-dns.alpha.kubernetes.io/cloudflare-proxied when reading records from Cloudflare. That’s great, but if you don’t specify that annotation on your ingresses (so, you’re using the default set in the Cloudflare provider), the desired state won’t have it but the records read from the provider do, so the planner thinks it needs to push an update.

This might be a problem for any other provider if they add an annotation with a “default” value.

Here’s what the planner saw:

(*plan.Plan)(0xc0002c4780)({
 Current: ([]*endpoint.Endpoint) (len=1 cap=4) {
  (*endpoint.Endpoint)(0xc0004b4cb0)(foo.example.com 1 IN A  10.1.2.3 [{external-dns.alpha.kubernetes.io/cloudflare-proxied false}]),
 },
 Desired: ([]*endpoint.Endpoint) (len=1 cap=4) {
  (*endpoint.Endpoint)(0xc000384800)(foo.example.com 1 IN A  10.1.2.3 []),
 },
 Policies: ([]plan.Policy) <nil>,
 Changes: (*plan.Changes)(0xc0006143c0)({
  Create: ([]*endpoint.Endpoint) <nil>,
  UpdateOld: ([]*endpoint.Endpoint) (len=1 cap=4) {
   (*endpoint.Endpoint)(0xc0004b4cb0)(foo.example.com 1 IN A  10.1.2.3 [{external-dns.alpha.kubernetes.io/cloudflare-proxied false}]),
  },
  UpdateNew: ([]*endpoint.Endpoint) (len=1 cap=4) {
   (*endpoint.Endpoint)(0xc000384800)(foo.example.com 1 IN A  10.1.2.3 []),
  },
  Delete: ([]*endpoint.Endpoint) <nil>,
 }
})

It then pushed an update, very reasonably, but of course that didn’t actually change anything, so the next time External DNS checked, it was in the exact same state.

Environment:

  • External-DNS version (use external-dns --version): 0.5.18 or above
  • DNS provider: Cloudflare
  • Others: n/a

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 18
  • Comments: 62 (17 by maintainers)

Commits related to this issue

Most upvoted comments

We’re using a private build of external-dns with my patch, to avoid this problem. I’m just doing another build and I’m going to push the result to Docker Hub so that others can use it. My PR #1464, which does fix this problem, has still not been reviewed by @linki or @njuettner, two months after I originally submitted it, and I’ve just had to rebase it again.

The fixed image is at al45tair/external-dns:fix-cloudflare-attr-update, if anyone else wants to try it.

The workaround for now is to use 0.5.17 which doesn’t have the bug.

@ninja- good to know I’m not the only one! I’ve had some really really strange behaviors going on and am currently working with the CF support on this very issue from their side as well.

@hayderimran7 it’s a drop-in replacement for the normal image; no need to change anything.

@al45tair This was not a blip or sporadic outage. I first knew there was a problem because I had “freshping” checking for outages. They reported the site down at 4:07AM UTC and it remained continuously unreachable to them (with checks every minute) for over 7 hours*, until reporting that the site was back up a few minutes after I had gotten around to fixing the issue via the “toggle proxy” workaround. So whatever broke the A records on Cloudflare’s end was definitely permanent until I finally went in and got it unwedged manually.

  • Again, personal site, so wasn’t in any particular rush - good thing I’m not using this for anything serious!

Upsert doesn’t always work eliminate the issue FYI. It’ll be less periodic I found anyways.

Sent from Yahoo Mail for iPhone

On Thursday, April 9, 2020, 4:30 AM, Markus Mahlberg notifications@github.com wrote:

It would be nice to know when the PR will be merged, because there are even more grave ramifications: one of the major reasons to use cloudflare is its DDoS protection. With the way it is now, you can either set up the policy to upsert-only (with which everything works as expected) OR you need to disable proxying the respective IP address altogether. Frankly that is a situation I would call less than ideal, either way.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Had a site go completely offline due to this. The incessant updates from external-dns eventually resulted in Cloudflare dropping all A records for the domain, and I needed to manually go in and kick the entries in the Cloudflare zone configuration to get them back. Good thing it was only a personal site!

I have stopped using external-dns for now because it’s no longer trustworthy. It had one job.

I’m also seeing this issue and also when adding multiple different records it changed the action to CREATE and the ones that where in UPDATE status were removed from Cloudflare.

This is a very big problem. The repeated updating of records ultimately leads to bugging out the cloudflare api and the records are no longer visible over DNS even though they are visible in the cliudflare control panel! To fix it you need to switch proxied status on/off in the control panel for example

Super annoying. Had to remove my external-dns instance and only run it for a small period of time when I am changing something in my setup.

As a ramification of this (I believe) I’ve been seeing situations where the record is actually in cloudflare but in a bad state that also results as an NXDOMAIN. Apparently they’ve seen this issue before from the API and I suspect it has something to do with a serial number for the record but not entirely sure.

If you login to the UI I see this:

Invalid DNS record identifier (Code: 1032)

I believe the problem is outside the scope of this issue, but I’m seeing it proportionally more due to the constant updates (only a handful of records in a test cluster and I’m regularly seeing NXDOMAIN when the record does actually exist in cloudflare).