compose: docker-compose hangs

Description of the issue

In a new server installation (on a VPS provided by OVH) running Ubuntu 18.x the Docker system works flawlessy and fast but docker-compose sometimes hangs indefinitely for no reason. And for every command, from version to ps, up, logs and so on. I tried everything regarding DNS and name resolution, even disabling systemd-resolved. When it works it is fast. When it hangs I can see the /tmp/_Mxyz123123 directory created. Running with verbose hangs before any output. And then completes flawlessy.

Context information (for bug reports)

Output of docker-compose version

docker-compose version 1.24.0, build 0aa59064
docker-py version: 3.7.2
CPython version: 3.6.8
OpenSSL version: OpenSSL 1.1.0j  20 Nov 2018

Output of docker version

Client:
 Version:           18.09.5
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        e8ff056
 Built:             Thu Apr 11 04:43:57 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.5
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       e8ff056
  Built:            Thu Apr 11 04:10:53 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker-compose config (Make sure to add the relevant -f and other flags)

... not related to a specific docker-compose file

Steps to reproduce the issue

  1. docker-compose <command>

Observed result

Sometimes hangs

Expected result

Run fast.

Stacktrace / full error message

No stacktrace but i managed to strace the behavior.

[...]
openat(AT_FDCWD, "/tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile> <0.000038>
fstat(3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile>, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 <0.000020>
write(3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile>, "# Generated automatically from M"..., 69632) = 69632 <0.000130>
fchmod(3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile>, 0700) = 0 <0.000050>
write(3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile>, "type.o   -o Modules/xxsubtype$(E"..., 43) = 43 <0.000029>
close(3</tmp/_MEIsYFQ2P/lib/python3.6/config-3.6m-x86_64-linux-gnu/Makefile>) = 0 <0.000021>
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2088645a10) = 3110 <0.000276>
rt_sigaction(SIGINT, {sa_handler=0x404840, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0 <0.000051>
rt_sigaction(SIGKILL, {sa_handler=0x404840, sa_mask=[KILL], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, 0x7ffe82f7d900, 8) = -1 EINVAL (Invalid argument) <0.000031>
rt_sigaction(SIGTERM, {sa_handler=0x404840, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0 <0.000028>
wait4(-1, 

[... here it hangs and then after 114s it continues output ...]

wait4(-1, docker-compose version 1.24.0, build 0aa59064
docker-py version: 3.7.2
CPython version: 3.6.8
OpenSSL version: OpenSSL 1.1.0j  20 Nov 2018
[{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 3110 <114.945565>
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3110, si_uid=0, si_status=0, si_utime=43, si_stime=6} ---
rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, {sa_handler=0x404840, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, 8) = 0 <0.000094>
rt_sigaction(SIGKILL, {sa_handler=SIG_DFL, sa_mask=[KILL], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f2087c55f20}, 0x7ffe82f7d900, 8) = -1 EINVAL (Invalid argument) <0.000150>

[...]

Additional information

Linux vps[omit] 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:28:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Ubuntu 18.04.2 LTS

docker-compose installed with curl

I have lot of other docker-compose environments working and this is the first time I see this behavior. Any idea on what to check or where to look?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 41
  • Comments: 51

Commits related to this issue

Most upvoted comments

Same behavior for me as well. I’m running on a Digital Ocean droplet running Ubuntu 18.04.2 LTS with 4.15.0-48-generic x86_64 GNU/Linux kernel. Installed via curl today.

I want to also confirm that installing haveged fixed our problem.

Using Digital Ocean’s Droplet, with:

  • ubuntu 18 LTS
  • docker 19.03.1
  • docker-compose 1.26.2
  • python based project, and porbably depends heavily on entropy such as SSL, etc

We used this article to install haveged https://www.digitalocean.com/community/tutorials/how-to-setup-additional-entropy-for-cloud-servers-using-haveged

I hope this helps anyone just in case.

No problem, it makes sense as it’s a new VPS with very little use. And installing haveged fixes the problem.

Not long before the /dev/random dance I described I see references to cryptography.hazmat.bindings._openssl.so in the traces.

With that knowledge this comment seems awfully telling: https://github.com/pyca/cryptography/blob/master/src/_cffi_src/openssl/src/osrandom_engine.c#L123...L127

Which in turn seems to derive from: https://github.com/pyca/cryptography/issues/4645 https://github.com/pyca/cryptography/pull/4656

So I feel pretty sure in declaring that this isn’t a compose issue as such, but some combination of Python’s behaviour and lack of entropy on some systems. Installing haveged or some other entropy daemon or otherwise ensuring you have sufficient entropy seems like the right answer.

I’m going to close since this does not appear to be a bug in compose, thanks for your help in getting to the bottom of the issue.

FWIW https://lwn.net/Articles/693189/ is an interesting article on a related subject (I’m not sure whether or not this issue is the one being described there, but the symptoms are certainly similar).

Why should “docker-compose --help” require system entropy?

So your process tree is:

29714 `docker-compose --version`
`-- 29717 self exec of `docker-compose --version`
    `-- 29718 exec `/sbin/ldconfig` → exec `/sbin/ldconfig.real`
    `-- 29719 exec `/sbin/ldconfig` → exec `/sbin/ldconfig.real`

The two processes 29718 and 29719 execute fully in <1s. e.g. dc.29718 is

6678-will-emmerson$ sed -n -e '1p;$p' dc.29718
10:33:59.478903 set_robust_list(0x7fa0b3052a20, 24) = 0
10:33:59.483282 +++ exited with 0 +++
6678-will-emmerson$ sed -n -e '1p;$p' dc.29719
10:33:59.597443 set_robust_list(0x7fa0b3052a20, 24) = 0
10:33:59.602154 +++ exited with 0 +++

Which I think rules out something slowing ldconfig down, at least in your case.

It’s the two main processes which take some time:

6678-will-emmerson$ sed -n -e '1p;$p' dc.29714
10:33:58.821291 execve("/usr/local/bin/docker-compose", ["docker-compose", "--version"], 0x7ffea9630aa8 /* 23 vars */) = 0
10:34:49.283590 +++ exited with 0 +++
6678-will-emmerson$ sed -n -e '1p;$p' dc.29717
10:33:59.145695 execve("/usr/local/bin/docker-compose", ["docker-compose", "--version"], 0xa6d630 /* 25 vars */) = 0
10:34:49.268734 +++ exited with 0 +++

In dc.29714 (the parent) there is a gap of ~48s after 10:34:01.454543:

10:33:59.145224 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fd813295a10) = 29717
10:33:59.145455 rt_sigaction(SIGINT, {sa_handler=0x404840, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
10:33:59.145495 rt_sigaction(SIGKILL, {sa_handler=0x404840, sa_mask=[KILL], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, 0x7ffd2ab7dcd0, 8) = -1 EINVAL (Invalid argument)
10:33:59.145521 rt_sigaction(SIGTERM, {sa_handler=0x404840, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
10:33:59.145545 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.163587 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.163639 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.279720 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.279774 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.305847 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.305904 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.333863 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.333915 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.396106 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.396157 wait4(-1, 0x7ffd2ab7dedc, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
10:34:01.454491 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.454543 wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 29717
10:34:49.268835 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=29717, si_uid=0, si_status=0, si_utime=44, si_stime=6} ---
10:34:49.268861 rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, {sa_handler=0x404840, sa_mask=[INT], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, 8) = 0
10:34:49.268891 rt_sigaction(SIGKILL, {sa_handler=SIG_DFL, sa_mask=[KILL], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, 0x7ffd2ab7dcd0, 8) = -1 EINVAL (Invalid argument)
10:34:49.268916 rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, {sa_handler=0x404840, sa_mask=[TERM], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fd8128a5f20}, 8) = 0

Which seems to be the parent waiting for the child.

In the child (dc.29717) the gap is after 10:34:01.454593:

10:33:59.629565 openat(AT_FDCWD, "/dev/random", O_RDONLY) = 3
10:33:59.629613 poll([{fd=3, events=POLLIN}], 1, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.163675 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.163690 restart_syscall(<... resuming interrupted poll ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.279809 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.279823 restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.305939 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.305952 restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.333951 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.333965 restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.396192 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.396205 restart_syscall(<... resuming interrupted restart_syscall ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
10:34:01.454579 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
10:34:01.454593 restart_syscall(<... resuming interrupted restart_syscall ...>) = 1
10:34:49.023121 close(3)                = 0

Aha! So the process is blocked in a poll syscall (poll([{fd=3, events=POLLIN}], 1, -1)) where the only passed fd is fd 3, which is a recently opened fd onto /dev/random (openat(AT_FDCWD, "/dev/random", O_RDONLY) = 3).

The /dev/random device is well known for blocking when there is insufficient entropy on the system. And VPSs/cloud VMs are well known for having a general shortage of entropy.

There is no use of /dev/random in compose (outside of a single unrelated test case), nor is there in docker-py, so this must be in one of our dependencies.

Although I cannot reproduce the hang I can see the use of /dev/random in my own strace (my system must have plenty of entropy I guess).

The maddest bit is that after blocking waiting for /dev/random to become readable it immediately closes it and opens /dev/urandom (the non-blocking one) instead and reads from that.

Well, thank you for those logs, they have at least given me a string to pull on and see where it takes me…

This happens to me as well, running on a Google Compute Engine instance with the same docker versions as above. Also Ubuntu 18.04. Even docker-compose --help does this, so no docker-compose.yaml is involved.

Yeah! I had number 6 in entropy_avail, after Haveged installation number increased to ~2500 and now docker-compose commands are running pretty fast.

Very strange issue, @ijc thank you for your investigation that issue was so unexpectable but interesting to discover.

Installing haveged fixed this problem for me.

CentOS 7 docker 19.03.5 docker-compose 1.25.0

This isn’t really fixed, you guys are just giving workarounds and patches… like the issue is still happening even on the latest version.

I don’t mean anything in my post, but there’s a difference between a closed issue and one that is hanging because the proper solution hasn’t been found yet.

Even if the author got satisfied, it doesn’t mean that we should ignore this issue. no?

Dont get me wrong.

FWIW, installing jitterentropy-rngd on a small server running Ubuntu 20.04 (via apt install jitterentropy-rngd) resolved this problem for me.

Now cat /proc/sys/kernel/random/entropy_avail shows a reasonable amount of entropy available (even after a fresh boot) and the hangs from docker-compose --help or docker-compose --version are significantly reduced.

@sebastialonso As a recap on this issue. The problem is that Docker requires system entropy. Sometimes the OS is not providing enough “natural entropy” for a number of reasons. In these cases you need to “fix the OS” generating more entropy and that’s what haveged is built for. Without that, Docker waits for enough entropy and that depends on what the OS is doing in the background.

@xcash I’m running on Docker version 18.09.6, build 481bc77 and it’s still happening.

Ubuntu 18.04.2LTS x64 with everything up to date

as a work around: you must call docker-compose command once and <hit enter> after that issue your desired command.

I am still having the same problem, with upgraded everything, but can confirm that running docker-compose on its own does seem to fix it.

I don’t understand the conclusion of this issue. At the end of it all, what can I do to not have my docker-compose commands from hanging??

@sebastialonso After reading this thread I solved it by installing haveged (https://github.com/docker/compose/issues/6678#issuecomment-494398740) on my VPS to get more entropy.

Here is another link on the same subject: https://www.digitalocean.com/community/tutorials/how-to-setup-additional-entropy-for-cloud-servers-using-haveged

@ijc congrats for the troubleshoot Ian! 😃 It makes sense even in my setup as the problem disappeared after that VPS went in production and thus started creating entropy.

While this is a system issue, should docker compose do some reasonable check of available entropy, and issue a warning at least?

As is, running a compose command locks the system. So even though it’s not a bug in compose, Trying to use compose is a killjoy.

If this was a problem for you please upvote https://github.com/docker/compose/issues/6931 so we can get better docs, which would have avoided lots of wasted time chasing down this problem.

Hi @M034B5

This isn’t really fixed, you guys are just giving workarounds and patches… like the issue is still happening even on the latest version.

This is a typical “won’t fix” case since this is not an issue in Docker or Compose. It’s a general system entropy issue. If a software needs entropy and the system is not providing enough it’s an OS issue. And everything you do to bypass that is a general workaround.

I don’t mean anything in my post, but there’s a difference between a closed issue and one that is hanging because the proper solution hasn’t been found yet.

Indeed. But here the proper solution is having a good amount of entropy provided by the operating system. 😃 You are searching the “proper solution” in the wrong place.

Even if the author got satisfied, it doesn’t mean that we should ignore this issue. no? Dont get me wrong.

Don’t get me wrong too and I’m not the author. I’m only providing what i think is the right point of view on this issue and why it’s closed. 😃

I’m not sure I want to upload a bunch of logs to a public place when I don’t really know what’s in them.

Same problem for me, Linode with Ubuntu 18.04.2. Going back to version 1.23.0 fixed it, although I had to recreate the containers.