compose: Error during pull
This error occurs during docker compose pull with --context argument
Steps to reproduce the issue:
- docker --context mycontext compose -f docker-compose.letsencrypt.yml pull
- error during connect: Get “http://docker.example.com/v1.41/images/mongo-express/json”: command [ssh -l root – pharma-bio.hr docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=kex_exchange_identification: Connection closed by remote host
Describe the results you received: error during connect: Get “http://docker.example.com/v1.41/images/mongo-express/json”: command [ssh -l root – pharma-bio.hr docker system dial-stdio] has exited with exit status 255, please make sure the URL is valid, and Docker 18.09 or later is installed on the remote host: stderr=kex_exchange_identification: Connection closed by remote host
Describe the results you expected: Normal pull of all images, both from my docker hub and public ones
Additional information you deem important (e.g. issue happens only occasionally):
Output of docker compose version
:
Docker Compose version v2.4.1
Output of docker info
:
Client:
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc., v0.8.2)
compose: Docker Compose (Docker Inc., v2.4.1)
sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc., 0.6.0)
scan: Docker Scan (Docker Inc., v0.17.0)
Server:
Containers: 22
Running: 13
Paused: 0
Stopped: 9
Images: 111
Server Version: 20.10.14
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 3df54a852345ae127d1fa3092b95168e4a88e2f8
runc version: v1.0.3-0-gf46b6ba
init version: de40ad0
Security Options:
seccomp
Profile: default
Kernel Version: 5.10.60.1-microsoft-standard-WSL2
Operating System: Docker Desktop
OSType: linux
Architecture: x86_64
CPUs: 20
Total Memory: 15.52GiB
Name: docker-desktop
ID: MGP6:OKSX:HXN5:OL2J:CK7W:OJQ2:HTGL:COC7:OFZY:DYIQ:HZNR:UB35
Docker Root Dir: /var/lib/docker
Debug Mode: false
HTTP Proxy: http.docker.internal:3128
HTTPS Proxy: http.docker.internal:3128
No Proxy: hubproxy.docker.internal
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
hubproxy.docker.internal:5000
127.0.0.0/8
Live Restore Enabled: false
WARNING: No blkio throttle.read_bps_device support
WARNING: No blkio throttle.write_bps_device support
WARNING: No blkio throttle.read_iops_device support
WARNING: No blkio throttle.write_iops_device support
Additional environment details: I’m using Ubuntu 20.04 WSL on client side, and deploying to Ubuntu server 20.04 (both sides fully patched).
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 9
- Comments: 57 (5 by maintainers)
Commits related to this issue
- Use docker compose v2 for pulling images docker/compose#9448 — committed to evgfilim1/userbot by evgfilim1 a year ago
I also have the issue. It’s intermittent but I’m able to reproduce it about 9/10 times.
After some (lots) of trial and errors and poking around, I think I’ve found the cause and some possible solution. I tried the
MaxStartups 500
fix above, but it didn’t solve the issue for me (in fact the problem seems to be on the client, see below).TLDR: it’s some kind of race condition with the ssh process spawned on the client.
Minimal repro steps
Environment (client)
5:20.10.18~3-0~ubuntu-focal
As the error message points out, a command is “killed” while doing the pull. By tracing the docker daemon logs (
journalctl -fxu docker.service
on the remote machine) and comparing a successful attempt with a failed attempt, I found out that some of the requests sent by the client were never received by the daemon.Since there were no signs of a process being killed or any sign of an issue on the remote, it had to be on the client. But what could possibly kill the ssh client ? I used auditctl to find out. The answer is a bit surprising : it’s the
docker-compose
binary itself (which is the parent process) !Digging further:
As some previous comments suggested, this commit
625a48d
indeed triggers the issue, specifically lines 58-62. Commenting out these lines seems to completely fix the issue.I print-debugged my way to understand what exactly was going on. Here’s what I think is happening: Disclaimer: I don’t have much experience in Go (in fact, never used it before) so my comprehension might be flawed. Please correct me, if I’m wrong.
getLocalImagesDigests
in turn callsgetImages
: The for loop starts many goroutines in parallel which are then awaited before returning.ImageInspectWithRaw
ends up spawning a new ssh connection. The http request is piped through ssh todocker system dial-stdio
on the remote. In the docker cli,commandconn.go
: the command is passed the sharedContext
from theerrgroup
in the caller (this will be important later).eg.Wait()
unblocks and the sharedContext
(linked to each ssh command) becomesDone()
.net.Conn
open (in this case, the sshcommandConn
) - maybe even reuses it (?). It still tries to readstdout
: (line numbers in the stack trace won’t match forcommandconn.go
due to some logs I sprinkled around)onEOF
, where we wait for the process to exit normally but it fails because it was previously killed. On line 165, the observed error message is returned to the caller, the wholeerrgroup
fails and everything is aborted.Solutions
Context
to the command is what causes the process to be killed. However, I think it’s legitimate to do so in case there’s an actual error. All requests run in parallel and all needs to succeeds in order to continue. If one of them fails, we want to kill/cancel the others to avoid unnecessary waiting.errgroup.Wait()
would not set theContext
to done when there is no error. Or some way to know if theContext
is done because of an error or because everything completed successfully. Then with a small change we could kill the command only in the case of an error. In theory, this would allow connection reuse (so potentially faster). The ssh process would keep running after the request/response completes. Subsequent request could then be run through it (docker system dial-stdio
seems to support multiple requests on the same session). IMO, this would be the ideal solution if doable (preferable to solution # 4 below).net.Conn
before the goroutine returns - not just finish reading the response. I found that callingCloseIdleConnections
properly closes thecommandConn
. Just before returning fromImageInspectWithRaw
, I addedcli.client.CloseIdleConnections()
and it completely fixed the issue 🎉 However, since this could theoretically also happen in other functions, there is probably a better place to put this. Maybe somewhere inrequest.go
? Since this needs to happen after the body is completely read, there would need to be a way to detect that.I know this is a lot of things to digest for what looks like a complicated issue, but could someone confirm some of this and chime in on the possible solutions ? Help is welcome to get a PR started 😃
One of the reasons I dedicated this much time to investigate the issue is because our CI pipeline rely on this behavior for deployments. Having it fail most of the time is not ideal to say the least 😉
Closing as “fixed by https://github.com/docker/cli/pull/3900” Thanks @pdaig 👍
I found a solution. I’ve opened a PR in docker/cli#3900.
Update:
docker:23
has been updated on Docker Hub with compose v2.16. I can happily confirm the issue is fixed! 🥳Docker Desktop and packages for Ubuntu/Debian/other OS should follow as well, but I don’t have the update schedule.
Doesn’t seem to be. I have Docker version
23.0.0
with docker-composev2.15.1
, and I am getting that same error:when i run
docker-compose -H "ssh://$SERVER_USER@$SERVER_IP" pull
from my Gitlab pipeline.Note: just previous to this, i am running
docker-compose -H "ssh://$SERVER_USER@$SERVER_IP" down --remove-orphans
, but not getting the error then.@glours @zachberger
Here is a minimal example demonstrating the problem. It seems it deals with the custom repositories.
this does not work:
But commenting out either one of the services results in a successful pull.
I have the exact same error. 2.4.1 works just fine, fails to pull on and after 2.5.0.
Please fix this!
@nitzel I don’t think so… as I understand it, not yet but very soon.
I just tried it with the official v23.0 release and it’s not fixed. The fix is live for docker/cli v23.0 but docker/compose v2.15.1 references an older version of docker/cli (v20.10.20). However, docker/compose v2.16 which was just released about 6 hours ago bumped that to v23. I guess depending on your installation method (Docker Desktop, OS package manager, image on Docker Hub) it might take more or less time for the new version to be distributed. I think Docker Hub is pretty fast to update but Docker Desktop might take a bit more time?
@jcandan v2.15.1 still has the bug so this is normal. 2 or more containers cause the issue, but 1 container is fine.
I’m also seeing this on:
Docker Compose version v2.12.2
with docker version:Was experiencing this problem in GitHub Actions, so I changed my local setup to reproduce. I wasn’t experiencing it on
docker-compose
, but am experiencing ondocker compose
. The MaxStartups change did not make any difference to me.Current workaround is to do:
Pulling images individually has also fixed my GitHub actions problem. It’s good there is a workaround, but it’s not particularly maintainable.
I can confirm this workaround works on archlinux, for anyone needing to downgrade docker-compose package:
@pdaig minimal example works as expected with client on macos.
Engine: 20.10.17 Compose: v2.10.2
But doesn’t work with client in
docker
image even on the same macos host or gitlab CI.UPD:
Pulling multiple images one by one works without any issue.
I ran into these errors ~75% of the time when doing a docker-compose ps over ssh from an M1 Mac to an ubuntu machine. They also happened during other docker-compose commands over ssh, and using context or
DOCKER_HOST
made no difference.Thanks to this comment, I finally managed to get rid of them completely, by putting
in
/etc/ssh/sshd_config
on the ubuntu target (followed by asudo service ssh restart
).Weirdly enough, the same setup works every time from my local MacBook via ssh to a remote Linux server.
Mac:
Docker Compose version v2.7.0
Remote Workstation (Archlinux):Docker Compose version 2.9.0
But from Linux to Linux server, it works only in 1 of 10 attempts.
This continues on 2.8.0
Also have this issue with 2.7.0
I suddenly run into the same issue in GitLab pipelines. The pipeline looks like this:
And the error message:
This pipeline used to work just fine about a month ago. The Docker Compose version on Alpine Linux a month ago was
1.29.2-r1
, but now it’s1.29.2-r2
.I’m also hitting this issue, it happens for both
ssh
andtcp
contexts.