traefik: Traefik gets stuck on HTTP-01 ACME challenge
Welcome!
- Yes, I’ve searched similar issues on GitHub and didn’t find any.
- Yes, I’ve searched similar issues on the Traefik community forum and didn’t find any.
What did you do?
I’m setting up a HTTP-01 based certificate resolver with Let’s Encrypt for Traefik. I’m using Traefik v2.6.1.
Unfortunately, external endpoints are reporting that they cannot connect to the server due a timeout, see for example https://letsdebug.net/u9k.de/917912
I have made sure this is not a firewall issue and not an IPv6 issue.
The timeout ONLY occurs when hitting the example.com/.well-known/acme-challenge/example123
route, not any other routes (e.g. example.com
and example.com/.well-known/
return immediately).
What did you see instead?
When I’m cranking up the logs to DEBUG and manually fetching the challenge endpoint, I can see the following logs lines:
curl -vv -4 http://u9k.de/.well-known/acme-challenge/foobar
* Trying 65.108.188.114:80...
* Connected to u9k.de (65.108.188.114) port 80 (#0)
> GET /.well-known/acme-challenge/foobar HTTP/1.1
> Host: u9k.de
> User-Agent: curl/7.81.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< Date: Fri, 18 Feb 2022 07:55:36 GMT
< Content-Length: 0
<
* Connection #0 to host u9k.de left intact
time="2022-02-18T07:45:26Z" level=debug msg="Retrieving the ACME challenge for u9k.de (token \"foobar\")..." providerName=acme
time="2022-02-18T07:46:18Z" level=error msg="Cannot retrieve the ACME challenge for u9k.de (token \"foobar\"): cannot find challenge for token \"foobar\" (u9k.de)" providerName=acme
Note the time between these two lines: it takes Traefik ~45 seconds to reply to this request.
It seems like something is holding an extensive lock in challenge_http.go
and the getTokenValue
function (which is called from ServeHTTP
) is getting stuck there:
https://github.com/traefik/traefik/blob/764bf59d4dfff2187cafa319e2b127c7e29fb3d5/pkg/provider/acme/challenge_http.go#L111
What version of Traefik are you using?
level=info msg=“Traefik version 2.6.1 built on 2022-02-14T16:50:25Z”
What is your environment & configuration?
I’m using the official Traefik Helm chart to deploy Traefik on Kubernetes. Since I don’t think the chart is relevant for this question, I’m posting the Traefik configuration the chart generates - note that these ports are internal and 8000/8443 are port-forwarded from 80/443 externally.
- --global.checknewversion
- --global.sendanonymoususage
- --entryPoints.metrics.address=:9100/tcp
- --entryPoints.traefik.address=:9000/tcp
- --entryPoints.web.address=:8000/tcp
- --entryPoints.websecure.address=:8443/tcp
- --api.dashboard=true
- --ping=true
- --metrics.prometheus=true
- --metrics.prometheus.entrypoint=metrics
- --providers.kubernetescrd
- --providers.kubernetesingress
- --entrypoints.websecure.http.tls=true
- --certificatesResolvers.le.acme.email=admin@cubieserver.de
- --certificatesResolvers.le.acme.storage=/data/acme.json
- --certificatesResolvers.le.acme.httpChallenge.entryPoint=web
- --certificatesResolvers.le.acme.caServer=https://acme-staging-v02.api.letsencrypt.org/directory
- --log.level=DEBUG
- --providers.kubernetesingress.throttleDuration=15s
- --providers.kubernetesIngress.allowEmptyServices=true
If applicable, please paste the log output in DEBUG level
Log output:
time="2022-02-18T07:54:52Z" level=debug msg="Unable to split host and port: address u9k.de: missing port in address. Fallback to request host." providerName=acme
time="2022-02-18T07:54:52Z" level=debug msg="Retrieving the ACME challenge for u9k.de (token \"foobar\")..." providerName=acme
time="2022-02-18T07:54:52Z" level=error msg="Error getting challenge for token retrying in 603.851468ms" providerName=acme
time="2022-02-18T07:54:53Z" level=error msg="Error getting challenge for token retrying in 759.621068ms" providerName=acme
time="2022-02-18T07:54:53Z" level=error msg="Error getting challenge for token retrying in 1.437936942s" providerName=acme
time="2022-02-18T07:54:55Z" level=error msg="Error getting challenge for token retrying in 2.485258498s" providerName=acme
time="2022-02-18T07:54:57Z" level=error msg="Error getting challenge for token retrying in 1.902903522s" providerName=acme
time="2022-02-18T07:54:59Z" level=error msg="Error getting challenge for token retrying in 2.616926306s" providerName=acme
time="2022-02-18T07:55:02Z" level=error msg="Error getting challenge for token retrying in 5.436445884s" providerName=acme
time="2022-02-18T07:55:07Z" level=error msg="Error getting challenge for token retrying in 7.213840431s" providerName=acme
time="2022-02-18T07:55:15Z" level=error msg="Error getting challenge for token retrying in 10.993923627s" providerName=acme
time="2022-02-18T07:55:26Z" level=error msg="Error getting challenge for token retrying in 10.487972011s" providerName=acme
time="2022-02-18T07:55:36Z" level=error msg="Cannot retrieve the ACME challenge for u9k.de (token \"foobar\"): cannot find challenge for token \"foobar\" (u9k.de)" providerName=acme
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 16 (7 by maintainers)
Hi, I add my comment to this issue because #8803 is already closed.
I have had the exact same issues as described in this and the already mentioned other issue: when traefik starts and there are certificate challenges to issue, they are being requested very early in the lifetime of the traefik process. If the requests coming from letsencrypt cannot be answered by traefik yet, the above mentioned error happens.
I have now debugged for several days, and also stumbled over these two issues. I have tried the workarounds by adding an init container that waits 20 seconds before starting the actual traefik container, and I have even rebuilt the traefik image to sleep 20 seconds before the traefik process is being started, both without any success.
After analyzing with tcpdump I had the idea that brought me the understanding: in my specific setup (k3s with k3s’ included “serviceLB” and exposed traefik using a LoadBalancer service, which should be pretty much standard, as it is the default setup in K3S out of the box), the TCP port of the Loadbalancer service (80, …) becomes responding to a TCP connect not before 7 seconds after Traefik process (PROCESS, not container, nor pod) started.
This means, that for me, and for all others using the same setup, LetsEncrypt ACME works, but those certificates that were open to be issued and traefik has been restarted (config changes, initial setup) will never be issued. This has cost me huge amount of time, and I am sure others will have the same issue.
I understand that built-in “delays” of any kind are not a beautiful solution, but they could solve a huge problem. Another possible idea would be to retry challenges which ended with the 400 error after some configurable time (once, not periodically, because of LE rate limit).
PLEASE, we really need a solution for this problem. It is common in Kubernetes and it is NOT in all setups solvable by adding an init container to wait a bit.
Thanks
By default, the helm chart configures the traefik Service as a load balancer, and configures the readiness probe on the pods with an initial delay of 10 seconds. Traefik attempts the ACME challenge within those 10s, so the request from LE never connects, and the LB doesn’t seem to complete the connection once ready. So the "
I am able to get certificate requests working consistently by removing the readiness probe from the deployment. This causes the LB to immediately allow traffic to traefik. I’m not sure if this is best-practice, but since everything is unreachable if traefik is, I’m not sure what value it provides in this case.
I deploy the helm chart with Kustomize, and use this patch to remove the probe: