compose: Error during pull

This error occurs during docker compose pull with --context argument

Steps to reproduce the issue:

  1. docker --context mycontext compose -f docker-compose.letsencrypt.yml pull
  2. error during connect: Get “http://docker.example.com/v1.41/images/mongo-express/json”: command [ssh -l root – pharma-bio.hr docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=kex_exchange_identification: Connection closed by remote host

Describe the results you received: error during connect: Get “http://docker.example.com/v1.41/images/mongo-express/json”: command [ssh -l root – pharma-bio.hr docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=kex_exchange_identification: Connection closed by remote host

Describe the results you expected: Normal pull of all images, both from my docker hub and public ones

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker compose version:

Docker Compose version v2.4.1

Output of docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., v0.8.2)
  compose: Docker Compose (Docker Inc., v2.4.1)
  sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc., 0.6.0)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 22
  Running: 13
  Paused: 0
  Stopped: 9
 Images: 111
 Server Version: 20.10.14
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3df54a852345ae127d1fa3092b95168e4a88e2f8
 runc version: v1.0.3-0-gf46b6ba
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.10.60.1-microsoft-standard-WSL2
 Operating System: Docker Desktop
 OSType: linux
 Architecture: x86_64
 CPUs: 20
 Total Memory: 15.52GiB
 Name: docker-desktop
 ID: MGP6:OKSX:HXN5:OL2J:CK7W:OJQ2:HTGL:COC7:OFZY:DYIQ:HZNR:UB35
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http.docker.internal:3128
 HTTPS Proxy: http.docker.internal:3128
 No Proxy: hubproxy.docker.internal
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  hubproxy.docker.internal:5000
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No blkio throttle.read_bps_device support
WARNING: No blkio throttle.write_bps_device support
WARNING: No blkio throttle.read_iops_device support
WARNING: No blkio throttle.write_iops_device support

Additional environment details: I’m using Ubuntu 20.04 WSL on client side, and deploying to Ubuntu server 20.04 (both sides fully patched).

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 9
  • Comments: 57 (5 by maintainers)

Commits related to this issue

Most upvoted comments

I also have the issue. It’s intermittent but I’m able to reproduce it about 9/10 times.

error during connect: Post “http://docker.example.com/v1.41/images/create?fromImage=docker&tag=latest”: command [ssh -l user – host docker system dial-stdio] has exited with signal: killed, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=

After some (lots) of trial and errors and poking around, I think I’ve found the cause and some possible solution. I tried the MaxStartups 500 fix above, but it didn’t solve the issue for me (in fact the problem seems to be on the client, see below).

TLDR: it’s some kind of race condition with the ssh process spawned on the client.

Minimal repro steps

version: "3.9"
services:
  docker:
    image: docker
  alpine:
    image: alpine
docker -H ssh://[user]@[host] compose pull
OR
DOCKER_HOST=ssh://[user]@[host] docker compose pull

Environment (client)

  • Latest docker image (15b7f46c4c94) OR
  • Ubuntu 20.04 with docker-ce-cli 5:20.10.18~3-0~ubuntu-focal

As the error message points out, a command is “killed” while doing the pull. By tracing the docker daemon logs (journalctl -fxu docker.service on the remote machine) and comparing a successful attempt with a failed attempt, I found out that some of the requests sent by the client were never received by the daemon.

Since there were no signs of a process being killed or any sign of an issue on the remote, it had to be on the client. But what could possibly kill the ssh client ? I used auditctl to find out. The answer is a bit surprising : it’s the docker-compose binary itself (which is the parent process) !

Digging further:

As some previous comments suggested, this commit 625a48d indeed triggers the issue, specifically lines 58-62. Commenting out these lines seems to completely fix the issue.

I print-debugged my way to understand what exactly was going on. Here’s what I think is happening: Disclaimer: I don’t have much experience in Go (in fact, never used it before) so my comprehension might be flawed. Please correct me, if I’m wrong.

  1. The call to getLocalImagesDigests in turn calls getImages: The for loop starts many goroutines in parallel which are then awaited before returning.
eg, ctx := errgroup.WithContext(ctx)
for _, img := range images {
  img := img
  eg.Go(func() error {
    inspect, _, err := s.apiClient().ImageInspectWithRaw(ctx, img)
    // ... returns err or nil
  })
}
return summary, eg.Wait()
  1. Each call to ImageInspectWithRaw ends up spawning a new ssh connection. The http request is piped through ssh to docker system dial-stdio on the remote. In the docker cli, commandconn.go : the command is passed the shared Context from the errgroup in the caller (this will be important later).
  2. The goroutine executes the http request and returns once the whole response body is read.
  3. When all goroutine finish the call to eg.Wait() unblocks and the shared Context (linked to each ssh command) becomes Done().
  4. As documented, the context’s completion kills the process.
  5. Even thought the whole http response body has been read and the goroutine has returned, the ssh process is still running because Go’s http client implementation seems to keep the net.Conn open (in this case, the ssh commandConn) - maybe even reuses it (?). It still tries to read stdout:
     github.com/docker/cli/cli/connhelper/commandconn.(*commandConn).onEOF(0xc0000bf080, {0x20404c0?, 0xc0000a2120})
         github.com/docker/cli@v20.10.17+incompatible/cli/connhelper/commandconn/commandconn.go:216 +0x4fb
     github.com/docker/cli/cli/connhelper/commandconn.(*commandConn).Read(0xc0000bf080, {0xc000535000, 0x0?, 0x1000})
         github.com/docker/cli@v20.10.17+incompatible/cli/connhelper/commandconn/commandconn.go:268 +0x28c
     net/http.(*persistConn).Read(0xc000695c20, {0xc000535000?, 0xc000520ba0?, 0xc0005afd30?})
         net/http/transport.go:1929 +0x4e
     bufio.(*Reader).fill(0xc0005bca20)
         bufio/bufio.go:106 +0xff
     bufio.(*Reader).Peek(0xc0005bca20, 0x1)
         bufio/bufio.go:144 +0x5d
     net/http.(*persistConn).readLoop(0xc000695c20)
         net/http/transport.go:2093 +0x1ac
     created by net/http.(*Transport).dialConn
         net/http/transport.go:1751 +0x173e
    
    (line numbers in the stack trace won’t match for commandconn.go due to some logs I sprinkled around)
  6. We end up in onEOF, where we wait for the process to exit normally but it fails because it was previously killed.
    go func() { werrCh <- c.cmd.Wait() }()
    
    On line 165, the observed error message is returned to the caller, the whole errgroup fails and everything is aborted.

Solutions

  1. ❌ Linking the Context to the command is what causes the process to be killed. However, I think it’s legitimate to do so in case there’s an actual error. All requests run in parallel and all needs to succeeds in order to continue. If one of them fails, we want to kill/cancel the others to avoid unnecessary waiting.
  2. 💡 It would be nice if there was a way that errgroup.Wait() would not set the Context to done when there is no error. Or some way to know if the Context is done because of an error or because everything completed successfully. Then with a small change we could kill the command only in the case of an error. In theory, this would allow connection reuse (so potentially faster). The ssh process would keep running after the request/response completes. Subsequent request could then be run through it (docker system dial-stdio seems to support multiple requests on the same session). IMO, this would be the ideal solution if doable (preferable to solution # 4 below).
  3. ❌ Not running things in parallel : this would slow things down.
  4. ✅ Actually close the net.Conn before the goroutine returns - not just finish reading the response. I found that calling CloseIdleConnections properly closes the commandConn. Just before returning from ImageInspectWithRaw, I added cli.client.CloseIdleConnections() and it completely fixed the issue 🎉 However, since this could theoretically also happen in other functions, there is probably a better place to put this. Maybe somewhere in request.go ? Since this needs to happen after the body is completely read, there would need to be a way to detect that.

I know this is a lot of things to digest for what looks like a complicated issue, but could someone confirm some of this and chime in on the possible solutions ? Help is welcome to get a PR started 😃

One of the reasons I dedicated this much time to investigate the issue is because our CI pipeline rely on this behavior for deployments. Having it fail most of the time is not ideal to say the least 😉

Closing as “fixed by https://github.com/docker/cli/pull/3900” Thanks @pdaig 👍

I found a solution. I’ve opened a PR in docker/cli#3900.

Update: docker:23 has been updated on Docker Hub with compose v2.16. I can happily confirm the issue is fixed! 🥳

Docker Desktop and packages for Ubuntu/Debian/other OS should follow as well, but I don’t have the update schedule.

Doesn’t seem to be. I have Docker version 23.0.0 with docker-compose v2.15.1, and I am getting that same error:

error during connect: Post "http://docker.example.com/v1.42/images/create?fromImage=bitnami%2Fdrupal&tag=latest": command [ssh -l [MASKED] -- [MASKED] docker system dial-stdio] has exited with signal: killed, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=

when i run docker-compose -H "ssh://$SERVER_USER@$SERVER_IP" pull from my Gitlab pipeline.

Note: just previous to this, i am running docker-compose -H "ssh://$SERVER_USER@$SERVER_IP" down --remove-orphans , but not getting the error then.

@glours @zachberger

Here is a minimal example demonstrating the problem. It seems it deals with the custom repositories.

this does not work:

version: '3.9'
services:
  hass:
    image: ghcr.io/home-assistant/home-assistant:2022.7.5
  ubuntu:
    image: ubuntu
docker --context xxxx compose  -f ./docker-compose-bad.yml pull
WARN[0000] commandConn.CloseWrite: commandconn: failed to wait: signal: killed 
[+] Running 0/0
 ⠿ hass Error                                                                                                                0.0s
 ⠿ ubuntu Error                                                                                                              0.0s
error during connect: Post "http://docker.example.com/v1.41/images/create?fromImage=ubuntu&tag=latest": command [ssh -l user -- xxxx docker system dial-stdio] has exited with signal: killed, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=

But commenting out either one of the services results in a successful pull.

docker context inspect xxxx
[
    {
        "Name": "xxxx",
        "Metadata": {},
        "Endpoints": {
            "docker": {
                "Host": "ssh://user@xxxx",
                "SkipTLSVerify": false
            }
        },
        "TLSMaterial": {},
        "Storage": {
            "MetadataPath": "/home/ivlis/.docker/contexts/meta/1f54ca1cf12c1f6a1c7c11ab42fe5b58e6bddead8e8024e3a4d9cd7ef345ff55",
            "TLSPath": "/home/ivlis/.docker/contexts/tls/1f54ca1cf12c1f6a1c7c11ab42fe5b58e6bddead8e8024e3a4d9cd7ef345ff55"
        }
    }
]

I have the exact same error. 2.4.1 works just fine, fails to pull on and after 2.5.0.

Please fix this!

Does that mean the fix is live now? Milesteon 23.0 is closed 🥳

@nitzel I don’t think so… as I understand it, not yet but very soon.

I just tried it with the official v23.0 release and it’s not fixed. The fix is live for docker/cli v23.0 but docker/compose v2.15.1 references an older version of docker/cli (v20.10.20). However, docker/compose v2.16 which was just released about 6 hours ago bumped that to v23. I guess depending on your installation method (Docker Desktop, OS package manager, image on Docker Hub) it might take more or less time for the new version to be distributed. I think Docker Hub is pretty fast to update but Docker Desktop might take a bit more time?

On a whim, I simplified my docker-compose to just a simple single container, and it worked. So, is this SSH error a red-herring?

@jcandan v2.15.1 still has the bug so this is normal. 2 or more containers cause the issue, but 1 container is fine.

I’m also seeing this on: Docker Compose version v2.12.2 with docker version:

Client: Docker Engine - Community
 Version:           20.10.21
 API version:       1.41
 Go version:        go1.18.7
 Git commit:        baeda1f
 Built:             Tue Oct 25 18:01:58 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.21
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.7
  Git commit:       3056208
  Built:            Tue Oct 25 17:59:49 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.10
  GitCommit:        770bd0108c32f3fb5c73ae1264f7e503fe7b2661
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Was experiencing this problem in GitHub Actions, so I changed my local setup to reproduce. I wasn’t experiencing it on docker-compose, but am experiencing on docker compose. The MaxStartups change did not make any difference to me.

Current workaround is to do:

docker compose pull my-image-1
docker compose pull my-image-2

Pulling images individually has also fixed my GitHub actions problem. It’s good there is a workaround, but it’s not particularly maintainable.

I got around this by locking the docker compose version to v2.3.3. Every other suggestion here failed. Using Github actions, this is what the job looks like.

I can confirm this workaround works on archlinux, for anyone needing to downgrade docker-compose package:

wget https://archive.archlinux.org/packages/d/docker-compose/docker-compose-2.3.3-1-x86_64.pkg.tar.zst
sudo pacman -U ./docker-compose-2.3.3-1-x86_64.pkg.tar.zst

@pdaig minimal example works as expected with client on macos.

Engine: 20.10.17 Compose: v2.10.2

But doesn’t work with client in docker image even on the same macos host or gitlab CI.

UPD:

docker compose pull docker
docker compose pull alpine

Pulling multiple images one by one works without any issue.

I ran into these errors ~75% of the time when doing a docker-compose ps over ssh from an M1 Mac to an ubuntu machine. They also happened during other docker-compose commands over ssh, and using context or DOCKER_HOST made no difference.

Thanks to this comment, I finally managed to get rid of them completely, by putting

MaxStartups 500

in /etc/ssh/sshd_config on the ubuntu target (followed by a sudo service ssh restart).

Weirdly enough, the same setup works every time from my local MacBook via ssh to a remote Linux server.

Mac: Docker Compose version v2.7.0 Remote Workstation (Archlinux): Docker Compose version 2.9.0

But from Linux to Linux server, it works only in 1 of 10 attempts.

This continues on 2.8.0

Also have this issue with 2.7.0

I suddenly run into the same issue in GitLab pipelines. The pipeline looks like this:

accept-deploy:
  image: docker:latest
  stage: accept
  variables:
    DOCKER_HOST: ssh://gitlab@example.com
  before_script:
    - apk update && apk add docker docker-compose openssh-client
    - ...
    - docker login -u "$CI_REGISTRY_USER" -p "$CI_JOB_TOKEN" "$CI_REGISTRY"
  script:
    - docker-compose build --pull
    - docker-compose pull
    - docker-compose up -d

And the error message:

$ docker-compose pull
time="2022-06-13T08:42:05Z" level=warning msg="network public: network.external.name is deprecated in favor of network.name"
nginx Skipped 
uwsgi Pulling 
postgres Pulling 
time="2022-06-13T08:42:05Z" level=warning msg="commandConn.CloseWrite: commandconn: failed to wait: signal: killed"
postgres Error 
uwsgi Error 
error during connect: Post "http://docker.example.com/v1.41/images/create?fromImage=postgres&tag=13-alpine": command [ssh -l gitlab -- example.com docker system dial-stdio] has exited with signal: killed, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=

This pipeline used to work just fine about a month ago. The Docker Compose version on Alpine Linux a month ago was 1.29.2-r1, but now it’s 1.29.2-r2.

I’m also hitting this issue, it happens for both ssh and tcp contexts.