docker: latest 'dind' tag (19.03) gives error on Gitlab CI "failed to dial gRPC: cannot connect to the Docker daemon. Is 'docker daemon' running on this host?"

We are running a gitlab server and several gitlab-ci-runners. Today we woke up to several failed builds. We did several tests and we found out that the most likely culprits are the newest tags of the docker:dind and docker:git images. We tested with docker:18-dind and docker:18-git and the errors does not occur anymore.

The error is given below:

time=“2019-07-23T06:52:31Z” level=error msg=“failed to dial gRPC: cannot connect to the Docker daemon. Is ‘docker daemon’ running on this host?: dial tcp 172.17.0.3:2375: connect: connection refused”

The gitlab-runners are running in privileged mode.

EDIT: This is not a bug or unresolved issue: see: https://github.com/docker-library/docker/issues/170#issuecomment-514366149, https://about.gitlab.com/2019/07/31/docker-in-docker-with-docker-19-dot-03/

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 78
  • Comments: 34 (10 by maintainers)

Commits related to this issue

Most upvoted comments

@tianon While I appreciate that using the stable or latest tags on the docker image runs the risk of breaking changes, Docker 19.03 has been in beta and RC for over 4 months and this change to the image was made just 6 days ago. I’ve been testing the docker:19.03.0-rc* images in my GitLab CI pipelines for months in preparation for the release, and didn’t run into this breaking change because it wasn’t in any of the RCs.

I think it’s very poor form to introduce such a breaking change in the last few days of a major release without any notifications.

I fix my self-hosted runners (debian, runners installed using apt-get):

$ nano /etc/gitlab-runner/config.toml
[[runners]]
-  environment = ["DOCKER_DRIVER=overlay2"]
+  environment = ["DOCKER_DRIVER=overlay2","DOCKER_TLS_VERIFY=1","DOCKER_CERT_PATH=/certs/client"]
  [runners.docker]
-    tls_verify = false
    image = "docker:dind"
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
-    volumes = ["/cache"]
+    volumes = ["/cache","/certs"]

And then:

$ service gitlab-runner restart

@JanMikes as the person responsible for the CI runners at our company: it’s already in full effect. People are assuming the runners are broken. 😣

Best example of why going with “blanket tags” like latest is a no-no.

Since jubel-hans comments are no longer here and I will repost the part that did the trick for me.

Adding the following variable:

  DOCKER_TLS_CERTDIR: ''

Also changed the docker image tag from stable to stable-dind and not sure if it was needed. edit: and after further testing it was not needed

There will not be an update in this repository to “fix” this as 19.03.0 is now released and GA and the TLS behavioral change was intentional (and applied to 19.03+ only by default to give folks two separate escape hatches to opt out – environment variable or downgrade).

See https://gitlab.com/gitlab-org/gitlab-runner/issues/4501#note_194648542 for a comment from a GitLab team member that sums up my thoughts even better than I could.

Same thing here. We reverted to the 18-dind tag in Gitab in the meantime.

@kinghuang IMHO it’s always a poor practice to introduce breaking change where it could be easily avoided. In that case we have a new feature that breaks old functionality if a variable is set to true. The problem is, that default value is true. I don’t quite understand what people doing such things have in mind. Unfortunately, it’s not the first time I see something like that in stable and broadly used open source project.

Specifying 18-dind as tag fixed it for us for now 😃

I thought I’d try setting my jenkins slaves to work correctly, using this manifest generated by the K8S plugin https://gist.github.com/REBELinBLUE/97a5c13c2589bb1f3df5a5b330718eb0

But it doesn’t seem to generate all the certificates before the job starts, I added ls /certs/** to the start of the job and I end up with

/certs/ca:
cert.pem
cert.srl
key.pem

/certs/client:
key.pem

/certs/server:
ca.pem
cert.pem
csr.pem
key.pem
openssl.cnf

if I add the liveness probe it seems to generate the certificates before it fully starts but then when I try to run docker commands I end up with Error response from daemon: Client sent an HTTP request to an HTTPS server. (yes I set the ports to 2376)

In the end I have given up and just set DOCKER_TLS_CERTDIR to an empty value and set the ports back to 2375 but I’d like to get it working properly

TCP connection without tlsverify has been unrecommended for years.

@janw Yep, exactly the same here, everyone was all over me this morning. My fault, shouldn’t have set the Jenkins slaves to use stable-dind, setting to 18-dind as suggested has fixed the issue. 🤦‍♂️

GitLab now has a really nice blog post up describing the situation and how to fix it if your environment is affected: https://about.gitlab.com/2019/07/31/docker-in-docker-with-docker-19-dot-03/ 👍

Besides setting DOCKER_HOST to use port 2376, you need to set DOCKER_TLS_VERIFY=1 and DOCKER_CERT_PATH=/certs/client to tell Docker to use TLS (and where to get certificates to handshake with).

Also, you should only share /certs/client with your client containers.

See also:

https://github.com/docker-library/docker/blob/d45051476babc297257df490d22cbd806f1b11e4/docker-entrypoint.sh#L22-L33

We met this error with using the stable tags of stable and stable-dind images. But 18-dind work.

TCP connection without tlsverify has been unrecommended for years.

If you are running everything locally, it’s okay though.

Me and my company got caught by this issue, that’s fine but that commit seems a bit rushed, like a big breaking change just before the release… (plz don’t revert though)