terraform-provider-docker: Intermittent errors with using SSH

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and docker Provider) Version

Terraform v1.0.3
on linux_amd64
+ provider registry.terraform.io/hashicorp/local v2.1.0
+ provider registry.terraform.io/hashicorp/random v3.1.0
+ provider registry.terraform.io/kreuzwerker/docker v2.14.0
+ provider registry.terraform.io/scottwinkler/shell v1.7.7

Affected Resource(s)

docker_container

Debug Output

╷
│ Error: Unable to read Docker image into resource: unable to pull image yandex/clickhouse-server:20.3.9.70: command [ssh -l username -- 123.123.123.123 docker system dial-stdio] has exited with signal: killed, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=
│ 
│   with module.clickhouse.docker_image.clickhouse,
│   on modules/clickhouse/main.tf line 23, in resource "docker_image" "clickhouse":
│   23: resource "docker_image" "clickhouse" {
│ 
╵

Expected Behaviour

No error.

Actual Behaviour

Intermittent errors with error messages as above.

Steps to Reproduce

Just try to launch 10-15 different containers with different images (kafka, postgres, etc, etc) over SSH. SSH is the most important thing.

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 22
Comments: 40 (13 by maintainers)

Commits related to this issue

Add integration test for SSH remote connections Uses a Digitalocean Droplet (Docker/Ubuntu) to provide a remote Docker host and verifies that it can successfully create a RemoteImage and run ... — committed to pulumi/pulumi-docker by guineveresaenger a year ago
Add integration test for SSH remote connections Uses a Digitalocean Droplet (Docker/Ubuntu) to provide a remote Docker host and verifies that it can successfully create a RemoteImage and run ... — committed to pulumi/pulumi-docker by guineveresaenger a year ago
Explicitly read in SSH config flags to the docker client configuration (#707) * Read in SSH flags to the connection helper to avoid error in Configure * Add integration test for SSH remote connect... — committed to pulumi/pulumi-docker by guineveresaenger 10 months ago

Most upvoted comments

Maintainer here: I am monitoring this issue closely and reading every comment. Under the hood we are simply using the docker client, so any issues from the docker client also appear in this provider. So anyone with that issue, please try out the “docker ssh” workaround from the comment above.

I still have not managed to build a reproducible case myself, that’s the first thing on my list. I won’t have time in the next 2-3 weeks, but after that hopefully will try to tackle this. Even after building a reproducible case I am not sure whether there will be a single/simple solution for that issue. Let’s see…

Junkern on Aug 9, 2022

Someone posted a solution on the docker-compose issue that not only works, but also speeds up the deployment since the docker socket is passed through a single socket instead of opening multiple ssh connections.

https://github.com/docker/compose/issues/8544#issuecomment-1060664712

Another option if you want to connect over SSH but not deal with all of docker’s ssh flakiness is to set up unix socket forwarding over the SSH connection. I ended up writing a script:
dockerssh() {
  rm -f /tmp/docker.sock
  cleanup() {
    ssh -q -S docker-ctrl-socket -p "${PORT}" -O exit "${HOST}"
    rm -f /tmp/docker.sock
  }
  trap "cleanup" EXIT
  ssh -M -S docker-ctrl-socket -p "${PORT}" -fnNT -L /tmp/docker.sock:/var/run/docker.sock "${HOST}"
  DOCKER_HOST=unix:///tmp/docker.sock eval "$*"
}
Then dockerssh docker compose ... to run a docker command pointing at that host or dockerssh bash to start a new shell pointing at that remote host.

I use it this way in my CICD scripts :

ssh -M -S ssh-control-socket -fnNT -L /tmp/docker.sock:/var/run/docker.sock "${HOST}"
DOCKER_HOST=unix://./docker.sock terraform apply -auto-approve
ssh -O exit -S ssh-control-socket "${HOST}"

Don´t forget to remove the host = "ssh:/..." in your terraform configuration

adam-lebon on Aug 8, 2022

The same error on large deployments, came from version 2.12 where it works more slowly and failed with an error connection refused, at 2.11 there is no error and everything works fast

binlab on May 18, 2022

@k2m30 I don’t think this issue is related to the server in any way, since downgrading the docker provider version solves the issue.

But while digging into the issue, I found out that the docker-compose community has been facing the exact same issue since May 2021 (which match the very first message of this issue). So during all this time, I blamed this PR “terraform-sdk v2” PR while the bug seems to be more related to the docker client.

The problematic version of the provider bumped to docker client from v20.0.0 to v20.10.5, It may be something to investigate on

adam-lebon on Aug 4, 2022

@compojoom indeed I do not think the ssh suggestion from above fixes the issue (it merely hides it / alleviates part of it).

dubo-dubon-duponey on Nov 9, 2021

Closing bug reports for lack of activity on the project might not be the right way to deal with issues…

dubo-dubon-duponey on Nov 6, 2021

I’m also commenting on this for the sake of keeping this issue opened. This is a serious issue

tiaden on Aug 9, 2023

Hi, guys. You can reproduce this bug as shown below:

terraform {
  required_providers {
    docker = {
      source  = "kreuzwerker/docker"
      version = "2.19.0"
    }
  }
}

provider "docker" {
  host  = "ssh://root@127.0.0.1:22"
  ssh_opts = ["-o", "StrictHostKeyChecking=no", "-o", "UserKnownHostsFile=/dev/null"]
}

resource "docker_container" "nginx" {
  count = 15
  name  = "nginx_${count.index}"
  image = "nginx:latest"
}

If you execute terraform apply, you may get error detail like this:

Unable to create container: error during connect: Post "http://docker.example.com/v1.40/containers/create?name=nginx_8": command [ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -l root -p 22 -- 127.0.0.1 docker system dial-stdio] has exited with signal: killed, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=Warning: Permanently added '127.0.0.1' (ECDSA) to the list of known hosts

But if you execute terraform apply -parallelism=1, that’s worked.

So in my opinion , this bug may be caused by concurrency.

This reply tells me that the problem is in using too many SSH connections.

Am I right, @Junkern, @prologic?

UPD Maybe the solution would be simply to introduce an SSH connection pooling proxy?

AndreiPashkin on Oct 30, 2022

After testing multiple commit manually, the bug has been introduced when migrating to the terraform-sdk v2 (MR Link) which is a huge MR 😭

adam-lebon on May 9, 2022

Same with 2.16, I get unable to pull when there are more than 2 containers to deploy.

I reverted to 2.11 and the problem is not present.

maximegirardet on Jan 27, 2022

Reusing sockets seems to help with the issue in my use case (https://docs.rackspace.com/blog/speeding-up-ssh-session-creation/#:~:text=The ControlMaster option is one,over the same underlying connection.)

Pretty much:

Host *
    ControlMaster     auto
    ControlPath       ~/.ssh/control-%C
    ControlPersist    yes

Not clear to me what could have changed (I assume in the vendored docker/cli) - but they do also mention it in their docs: https://docs.docker.com/engine/security/protect-access/#ssh-tips

This is not a fix though, and likely to cause issues.

dubo-dubon-duponey on Sep 2, 2021

Maybe this is related. #80

I don’t think this is related to #80. I have been using this docker provider for a year and I started to seeing this error since a few weeks when I try to deploy more than ~10 containers. It appears randomly when too many ssh connections are established. Even with -parallelism=2, the error still occurr. The terraform debug doesn’t provide any useful informations and the stderr is empty as mentioned by @AndreiPashkin.

I will try to investigate into this during the next weeks

adam-lebon on Aug 12, 2021