features: Docker-in-docker: Add retry mechanism into the docker init script (Failed to connect to Docker)

Sometimes docker fails to start within a container with the following error 👇

Failed to connect to Docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

As manually running /usr/local/share/docker-init.sh fixes this issue, add some retry mechanism into the docker-init script .

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 27 (12 by maintainers)

Most upvoted comments

docker-in-docker v2.4.0 includes the following changes, see https://github.com/devcontainers/features/pull/669

  • Adds retries for docker daemon startup
  • We have seen errors like sed: couldn't flush stdout: Device or resource errors which fails to start, this adds retries to fix such sed errors
  • Adds a workflow which runs 100 jobs for validating docker startup --> 50 jobs validates dockerd after the container is started, and 50 jobs validates within the onCreateCommand. The stress test is 🟢

Important Note: /usr/local/share/docker-init.sh which starts/retries dockerd is added to the entrypoint command. This command runs in the background and is not a blocking script for the container startup. Since it’s in the background, onCreateCommand/postCreateCommand/postStartCommand could all start executing before docker is fully running. If it takes docker too long, that could introduce flakiness in those lifecycle scripts.

Opened https://github.com/devcontainers/spec/issues/299 which requests a new semantics to have “blocking” entrypoints that the CLI waits for. This way we can ensure that docker is already up and running for the mentioned ^ lifecycle scripts and is available in the container.

Closing in the favor of https://github.com/devcontainers/features/issues/671. Feel free to reopen if needed, or comment on https://github.com/devcontainers/features/issues/671 if you still run into docker not running issues. Thank you!

Re-opening as retry logic is reverted. See https://github.com/devcontainers/features/pull/659

Opened https://github.com/devcontainers/features/issues/660 for tracking docker failures due to "sed: couldn't flush stdout: Device or resource" errors

@samruddhikhandale Thanks so much for the detailed technical background information, very helpful! 🙏

Since the bug did occur only 1 times out of 15, I can’t really say if it really fixes the problem now. I’ll post here again if it happens again, but hopefully that won’t be the case. 👍

@mandrasch One more thing, the universal image is cached in a codespace, hence, even now you will get a pull of an older image (unless you pin it to 2.5.0). I am working on updating the cache for Codespaces, but that would take a day or two.

Is there a way to check which docker-in-docker version is used inside the universal image? (I checked https://github.com/devcontainers/images/pull/705 but could not find a commit related to a version number?) Thanks!

Unfortunately, I don’t think there’s a direct way to find out the Feature version.

I created a sample repo with similar configuration as we use. However, the prebuild doesn’t fail on this one. Would need to spend some time on this to reproduce the issue.

https://github.com/tom-growthbox/prebuild-error

We have experienced this issue consistently in the last 4 days. It happens during codespaces prebuild. I see the line (*) Failed to start docker, retrying in 5s... once in each of the failed jobs. Successful jobs do not have this line in the log. It didn’t start with version 2.3.0, but somehow downgrading to 2.2.1 fixes it. I have not seen the error with 2.2.1.

@samruddhikhandale Thank you for your fast reply!

We tried to create more than 30 codespaces during the last few hours to try to reproduce the issue. At the moment we are not able to produce (*) Failed to start docker, retrying in 5s... in creation logs. Earlier this morning the docker step was failing almost in every codespace.

Today, we made sure that not the prebuilt image was used and that the codespaces are created for new branches. Also, we explicitly set 2.3.0 for the D-in-D feature.

We will keep an eye on the logs and the stability and will create another issue in case it is reproducible.