external-dns: Some new TXT records are not being cleaned up, causing an "InvalidChangeBatch" error

What happened:

After deleting some ingress resources, it seems that the new TXT record is not being cleaned up, but the other two DNS entries (the A record and the legacy TXT record) are being cleaned up. When searching for DNS records in AWS53, this is what we see:

Searching for <our-dns-name>.

[]

Searching for a-<our-dns-name>.

    {
        "Name": "a-<our-dns-name>.",
        "Type": "TXT",
        "TTL": 300,
        "ResourceRecords": [
            {
                "Value": "\"heritage=external-dns,external-dns/owner=<our-owner-string>,external-dns/resource=ingress/<our-ingress>\""
            }
        ]
    }

This later on causes an issue when we redeploy the application, as external-dns tries to create those three DNS entries (A record, legacy TXT and new TXT):

time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE a-<our-dns-name> TXT [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE <our-dns-name> A [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=info msg="Desired change: CREATE <our-dns-name> TXT [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=error msg="Failure in zone <our-dns-zone>. [Id: /hostedzone/<redacted>]"
time="2022-11-23T14:06:14Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='a-<our-dns-name>.', type='TXT'] but it already exists]\n\tstatus code: 400, request id: d65bc8e2-4055-4d9f-8412-4653debd76ff"

What you expected to happen:

The new TXT record should be cleaned up in the first place, or maybe we could also replace the TXT record if it already exists, or have an option to do so.

How to reproduce it (as minimally and precisely as possible):

I do not know how to reproduce this issue easily, but I’m more than happy to provide as much debugging info as needed.

Anything else we need to know?:

N/A

Environment:

  • External-DNS version (use external-dns --version): 0.13.1
  • DNS provider: AWS
  • Others:

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 19
  • Comments: 15

Most upvoted comments

I have faced the same issue after updating the external-dns version from 0.12.0 to 0.13.1. And instead of syncing with previously created TXT record graylog.<domain> it tries to create cname-graylog.<domain> and it fails with output below:

time="2022-12-14T11:49:54Z" level=error msg="InvalidChangeBatch: [The request contains an invalid set of changes for a resource record set 'TXT cname-graylog.<domain>.', The request contains an invalid set of changes for a resource record set 'TXT cname-mongodb.<domain>.', The request contains an invalid set of changes for a resource record set 'TXT cname-tcp.graylog.<domain>.']\n\tstatus code: 400, request id: <Id>"
time="2022-12-14T11:49:54Z" level=info msg="Desired change: CREATE cname-graylog.<domain> TXT [Id: /hostedzone/<hostedzone>]"
...

We’re facing the same issue with v0.13.2 and the suggested batch size changes do not work:

  • With --aws-batch-change-size=1: it tries to create the already existing TXT record, which fails. It does not even attempt to create the A record, presumably because the first batch change within the sync interval failed. This does not resolve itself eventually and continues like this in every sync interval.
  • With --aws-batch-change-size=2: it tries to create the A record and the already existing TXT record in a batch and this fails. Same behaviour as above, it’s stuck.

The only option we have is to either manually create the A record, or to delete the existing TXT records so that external-dns can properly recreate everything.

The expected behaviour would be to not attempt to create the TXT records again (if anything, it should upsert existing records).

Update: from what I can see, there’s already a change in master which might partially fix this (https://github.com/kubernetes-sigs/external-dns/commit/7dd84a589d4725ccf25d94f8d71b0146fee4bfcc), but it’s still unreleased.

Same problem as well. I wish there was a “force-overwrite” options where we could just tell external-dns to overwrite records; we have multiple clusters who have this error and are seemingly stuck. The worse part is good, new ingresses never have their DNS records created since they get batched up with these bogus retries.

I’m having the same problem with version 0.13.1 and --aws-batch-change-size=100. I tried --aws-batch-change-size=1 and started to get warnings like

time="2023-02-08T15:32:34Z" level=warning msg="Total changes for xxx.yyy.zzz exceeds max batch size of 1, total changes: 2"

and the errors as described above kept coming.

So I tried --aws-batch-change-size=2 and that has actually resolved the problem for me.