lego: Possible bad propagation check with dns-01 challenge

Welcome

  • Yes, I’m using a binary release within 2 latest releases.
  • Yes, I’ve searched similar issues on GitHub and didn’t find any.
  • Yes, I’ve included all information below (version, config, etc).

What did you expect to see?

A sucessful certificate generation.

What did you see instead?

We had an error message from Let’s Encrypt: 2022/11/30 10:24:59 error: one or more domains had a problem: [cloud.syseleven.de] acme: error: 400 :: urn:ietf:params:acme:error:dns :: DNS problem: SERVFAIL looking up TXT for _acme-challenge.cloud.syseleven.de - the domain's nameservers may be malfunctioning", but the propagation check with 2 name servers apparently passed:

2022/11/30 10:23:40 [INFO] [cloud.syseleven.de] acme: Could not find solver for: tls-alpn-01
2022/11/30 10:23:40 [INFO] [cloud.syseleven.de] acme: Could not find solver for: http-01
2022/11/30 10:23:40 [INFO] [cloud.syseleven.de] acme: use dns-01 solver
2022/11/30 10:23:40 [INFO] [cloud.syseleven.de] acme: Preparing to solve DNS-01
2022/11/30 10:23:50 [INFO] [cloud.syseleven.de] acme: Trying to solve DNS-01
2022/11/30 10:24:00 [INFO] [cloud.syseleven.de] acme: Checking DNS record propagation using [8.8.8.8:53 4.4.4.4:53]
2022/11/30 10:24:10 [INFO] Wait for propagation [timeout: 10m0s, interval: 10s]
2022/11/30 10:24:10 [INFO] [cloud.syseleven.de] acme: Waiting for DNS record propagation.
2022/11/30 10:24:20 [INFO] [cloud.syseleven.de] acme: Waiting for DNS record propagation.
2022/11/30 10:24:30 [INFO] [cloud.syseleven.de] acme: Waiting for DNS record propagation.
2022/11/30 10:24:47 [INFO] [cloud.syseleven.de] acme: Cleaning DNS-01 challenge

With a successful result, before the last line, there would have been a message like [cloud.syseleven.de] The server validated our request.

Note that we used 4.4.4.4 which was a working public name server in the past, but apparently no longer. It must have stopped working relatively recently. We discovered this while trying to debug this. As of now, it does not respond to any query.

However there was no indication from lego that there was a problem and it looks like it accepted the broken server as working, and continued on, as if everything was working.

Even when we replaced 4.4.4.4 with another server, the next attempt failed in the same way.

This makes me think that the propagation check doesn’t really work. How else could a random nameserver serve the correct TXT record (I surely hope that this is part of the check, right?) but when Let’s Encrypt does the query it fails. I noticed that you get the SERVFAIL error also if the TXT record is simply missing. It seems extremely unlikely that the name servers worked long enough for a query via 8.8.8.8 to work, and then suddenly broke when Let’s Encrypt

How do you use lego?

Docker image

Reproduction steps

We use a gitlab CI pipeline to run this command periodically: lego --accept-tos --dns, designate --path /tmp/lego --dns.resolvers 8.8.8.8 --dns.resolvers", 4.4.4.4 --server=https://acme-v02.api.letsencrypt.org/directory --email noreply@syseleven.de --key-type rsa4096 -d "*.cloud.syseleven.net" -d "*.infra.sys11cloud.net" -d "*.infrabk.sys11cloud.net" -d "*.infrabl.sys11cloud.net -d "*.infrafe.sys11cloud.net" -d "cloud.syseleven.de" renew --preferred-chain "ISRG Root X1"

Version of lego

Our docker image is based on

`FROM goacme/lego:v4.9.1`

Logs

See above

Go environment (if applicable)

No response

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 1
  • Comments: 18 (8 by maintainers)

Most upvoted comments

@oseiberts11, this should help you debugging:

Dockerfile

BuildKit required!

# syntax=docker/dockerfile:1.4

FROM golang:1-alpine as builder
RUN apk --no-cache --no-progress add make git curl
ENV GO111MODULE on
WORKDIR /go

# clone repository
RUN <<eot
	git clone https://github.com/go-acme/lego /go/lego
	cd lego
	git checkout v4.9.1
	go mod download
eot

# download and apply patch, build lego
RUN <<eot
	cd lego
	curl -sSL https://gist.githubusercontent.com/dmke/f2d31407cc17d7801a0f32ebbe6cd283/raw/42f1e85035a617939b66b3878bb28617e692d72d/debug.patch |
		git apply -- -
	git tag debug/1777
	make build
eot

FROM alpine:3.12
RUN <<eot
	apk update
	apk add --no-cache ca-certificates tzdata
	update-ca-certificates
eot
COPY --from=builder /go/lego/dist/lego /usr/bin/lego
ENTRYPOINT ["/usr/bin/lego"]

The patch can be found here: https://gist.github.com/dmke/f2d31407cc17d7801a0f32ebbe6cd283.

To build a drop-in-replacement for the goacme/lego:v4.9.1 image, copy the Dockerfile on your system and run:

$ DOCKER_BUILDKIT=1 docker build --tag syseleven/lego:v4.9.1-debug1777 .

You probably don’t want to distribute the image, as it skips the cleanup procedure entirely.

I also started encountering this behavior, out of nowhere, after a year+ of certs renewing automatically without any issue.

I’m using the Docker image, running the command:

lego --path /lego --accept-tos --email public@lowbar.fyi --dns joker --domains 'lab.pins.atomized.org' --domains '*.lab.pins.atomized.org' renew

Logs:

2023/02/21 18:16:55 [INFO] retry due to: acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/authz-v3/198980437317 :: urn:ietf:params:acme:error:badNonce :: JWS has an invalid anti-replay nonce: "1AADws2AaIsRrYBcCtVB7bxCXSW0j6wcJCtdxyp3Lj9JJMA"
2023/02/21 18:16:56 [INFO] Skipping deactivating of valid auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/198980437317
2023/02/21 18:16:56 [INFO] Deactivating auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/205312993556
2023/02/21 18:16:57 [INFO] Deactivating auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/205312993566
2023/02/21 18:16:57 error: one or more domains had a problem:
[*.lab.pins.atomized.org] time limit exceeded: last error: NS c.ns.joker.com. did not return the expected TXT record [fqdn: _acme-challenge.lab.pins.atomized.org., value: (redacted)]: 
[lab.pins.atomized.org] time limit exceeded: last error: NS a.ns.joker.com. did not return the expected TXT record [fqdn: _acme-challenge.lab.pins.atomized.org., value: (redacted)]: 

I have solid IPv4 and IPv6 connectivity. I’ll wait a couple hours to make sure I don’t run into cached NXDOMAIN and set DESIGNATE_POLLING_INTERVAL=60, to see what happens. My certs expire in 99 hours, hopefully it’ll work again before that. Not thrilled about the situation.

Thanks for the offer. I think the Dockerfile would work fine.