moby: Daemon can't be started after swarm certificates expire

Swarm certificates automatically renew and have 90 day expiry period by default. Still, if you don’t start the daemon during that time the certificates will expire and starting daemon will fail with time="2016-06-29T17:18:06.165656736Z" level=fatal msg="Error creating cluster component: error while loading TLS Certificate in /var/lib/docker/swarm/certificates/swarm-node.crt: x509: certificate has expired or is not yet valid"

I think refusing to start and not ignoring this error is correct. We could provide --reset-swarm option to leave swarm so the user doesn’t need to remove the state dir manually. Problem is that user must remember to remove this option as otherwise, it would clear the state on every next restart as well.

Maybe a good enough solution would be to add instructions for removing the state directory in the error message.

@nathanleclaire

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 34 (22 by maintainers)

Most upvoted comments

@robbyoconnor try renaming / removing the /var/lib/docker/swarm directory, which should disable swarm mode

@aaronlehmann @diogomonica https://github.com/docker/docker/commit/3090be95b70d3e7c2714a1b499a7bd143af018e6 is the fix on top of #27967 . I think we should wait for #27967 first because otherwise one of them would need a bad rebase.

Perhaps I should add that the expiration time is configurable through the --cert-expiry flag; https://github.com/docker/docker/blob/v1.12.3/docs/reference/commandline/swarm_init.md

@diogomonica I think we should add a section to the Swarm admin guide (https://docs.docker.com/engine/swarm/admin_guide/) to explain the (dis)advantages of setting a longer/shorter expiration time (I recall some people setting it to a really short time, e.g. 1 hour)

My docker for mac suddenly didn’t start any more.

Luckly I found the $HOME/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/log/docker.log file stating that swarm-node.crt within the docker image is not valid any more. I was able to move the Docker.qcow2 image to a Linux Box, mount and remove the swarm-node.crt file within the container and moving back the image, and docker works again. But how does an average user is supposed to fix that issue? IMHO This urgently has to be fixed.

Could you please elaborate on “how to remove the state directory”

This makes me very uneasy – This should not prevent the engine from starting. It should make swarm features not work, but the engine should start.

@diogomonica Problem is that daemon doesn’t start so you have no way to execute swarm leave.

I think our options are:

  • Add --reset-swarm and risk someone accidentally using it losing their swarm state. Also this as limited benefits for docker4mac unless they add it in the UI.
  • Log error and start daemon without swarm. If you run docker info you will see the certificate error. You can now do docker swarm leave or maybe docker swarm init --force-new-cluster.
  • As this is mostly for dev machines, unlikely to have multiple nodes, we could just use the ca key that is locally available to generate new certificate.