traefik: Traefix does not start when LetsEncrypt does not work

Today we have suffered an issue when we tried to move our traefik tasks from one node to another. The scheduler has tried to launch the service in another node but it failed because of Letsencrypt failures.

$ docker inspect 9lb8xxxxxx (traefik task)
...
"Status": {
            "Timestamp": "2016-10-31T13:11:04.538031123Z",
            "State": "failed",
            "Message": "started",
            "Err": "task: non-zero exit (1)",
            "ContainerStatus": {
                "ContainerID": "b86862bdc9563a440c36861d4756055b1e26e4a0b08146bd00793c87d0d95b84",
                "ExitCode": 1
            }
        },
...
$ docker logs 551fc96db70c
time="2016-10-31T13:03:44Z" level=error msg="Error creating TLS config get directory at 'https://acme-v01.api.letsencrypt.org/directory': acme: Error 502 - urn:acme:error:serverInternal - The service is down for maintenance or had an internal error. Check https://letsencrypt.status.io/ for more details."
time="2016-10-31T13:03:44Z" level=fatal msg="Error preparing server: get directory at 'https://acme-v01.api.letsencrypt.org/directory': acme: Error 502 - urn:acme:error:serverInternal - The service is down for maintenance or had an internal error. Check https://letsencrypt.status.io/ for more details."

And it was because of a failure in Letsencrypt. image

It provoked a downtime on our proxied services. But even our system at the moment didn’t need to generate any TLS certificate traefik failed its starting process.

Could this error be transformed to a warning or let the traefik server run when the Letsencrypt service is not working? As I said we didn’t need any new TLS certificate, because all of them were stored in a shared volume in our cluster. In this case the consequence is that even the services exposed in the 80 port were not working.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 16
  • Comments: 22 (13 by maintainers)

Most upvoted comments

I think this should happen:

  1. Traefik starts always.
  2. Traefik serves the sites always.
  3. Traefik encrypts https connection always.
    • If there is ACME connection, handle renewals as usual and use good certs.
    • If there is no ACME connection:
      • If we have an older cert for the site, use that one (even if it is expired).
      • Otherwise (maybe a new site or a new Traefik instance), use a self-generated cert.

This comes out from some red lines I believe Traefik should not cross:

  • Leaving the sites unreachable.
  • Leaving the sites unencrypted.

Let’s encrypt is down again today. As @bvis found, this is stopping Traefik from starting, and causing downtime for all our sites. I’ve removed the [acme] part from traefik.toml to get things going (http only). Is this likely to be fixed?

(Sorry if this sounds like nagging, it’s just I’m otherwise really happy with traefik and it’d be a shame to have to use something else because of this)

Closed by #2794.

@mvdstam I do not minimize the bug, I’m just trying to explain that it’s a different problem.

Once again, we are a small team, we have a lot of topics, we try to do the best.

For now, we are not working on this bug, so it will not be fixed by us in the next release.

Anyone can solve this problem by opening a PR, it’s open source 😃

Thanks @idez, hope to see this fixed in a subsequent release since it’s kind of a critical issue. Cheers.