moby: Docker hang after intensive run/remove operations

Problem description: Create a container (https://github.com/prasmussen/glot-containers) to compile/execution a source file, read out the output and remove this container. Normally each container will exit in seconds, and we will kill this container if it cannot finish in 30 seconds. We use three threads to repeat the whole process in parallel. Docker hang after thousands of normal operations. One process called ‘exe’ consumes 100% CPU of one core. I have to reboot the machine to make docker daemon back to work.

uname -a
Linux xuyun-workpc 4.2.0-25-generic #30-Ubuntu SMP Mon Jan 18 12:31:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
docker version
Client:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:20:08 UTC 2015
 OS/Arch:      linux/amd64

Server:
 Version:      1.9.1
 API version:  1.21
 Go version:   go1.4.2
 Git commit:   a34a1d5
 Built:        Fri Nov 20 13:20:08 UTC 2015
 OS/Arch:      linux/amd64
docker info
Containers: 4
Images: 131
Server Version: 1.9.1
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.0-25-generic
Operating System: Ubuntu 15.10
CPUs: 4
Total Memory: 7.672 GiB
Name: xuyun-workpc
ID: OI5K:OD3L:U44C:4ZWM:N3LZ:BARZ:KDEQ:5DIY:LLBE:UKHP:VBEN:AY6L
WARNING: No swap limit support

After I reboot the machine, I found there are some containers left,

docker ps -a
CONTAINER ID        IMAGE               COMMAND               CREATED             STATUS                      PORTS               NAMES
f72001c02ff1        glot/clang:latest   "/home/glot/runner"   15 hours ago        Created                                         amazing_engelbart
7806233fbb4f        glot/java:latest    "/home/glot/runner"   15 hours ago        Exited (137) 14 hours ago                       amazing_tesla
59694504b3e8        glot/java:latest    "/home/glot/runner"   15 hours ago        Dead                                            elated_galileo
64e63df44436        glot/php:latest     "/home/glot/runner"   15 hours ago        Exited (137) 14 hours ago                       thirsty_knuth

And from container 596 I found some clue related to overlay driver.

docker logs 596
Error response from daemon: open /var/lib/docker/overlay/59694504b3e88365c9323237c4418c51350941e7498de05584cc37d69f139d09/lower-id: no such file or directory

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 1
  • Comments: 37 (13 by maintainers)

Commits related to this issue

Most upvoted comments

I also reproduced with Docker 1.10.0, Ubuntu 15.10 (overlay).

How to reproduce

Run for f in $(seq 1 1000);do docker run -it --rm ubuntu echo $f; done concurrently in 3 terminals. (Still not sure whether this is related to concurrency, but I could not reproduce when ran 10k times in a single terminal)

dmesg

[604812.196052] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [exe:4956]
[604812.196052] Modules linked in: aufs veth isofs xt_conntrack xt_addrtype br_netfilter overlay xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
xt_tcpudp bridge pci_stub stp llc vboxpci(OE) vboxnetadp(OE) iptable_filter vboxnetflt(OE) ip_tables x_tables vboxdrv(OE) intel_rapl ppdev xen_fbfront fb_sys_fops input_leds i2c_piix4 8250_fintek parport_pc serio_raw mac_hid parport autofs
4 btrfs xor raid6_pq crct10dif_pclmul crc32_pclmul cirrus syscopyarea sysfillrect aesni_intel sysimgblt ttm aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd drm_kms_helper drm psmouse pata_acpi floppy
[604812.196052] CPU: 0 PID: 4956 Comm: exe Tainted: G           OEL  4.2.0-27-generic #32-Ubuntu
[604812.196052] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/07/2015
[604812.196052] task: ffff880202681b80 ti: ffff88010fe74000 task.ti: ffff88010fe74000
[604812.196052] RIP: 0010:[<ffffffff813d41d8>]  [<ffffffff813d41d8>] memcpy_erms+0x8/0x10
[604812.196052] RSP: 0018:ffff88010fe77cc0  EFLAGS: 00000293
[604812.196052] RAX: ffff88020556d00c RBX: ffffffff813cf378 RCX: 0000000000000000
[604812.196052] RDX: 0000000000000001 RSI: ffffffff81aba33c RDI: ffff88020556d00d
[604812.196052] RBP: ffff88010fe77d38 R08: 000000000000000a R09: 000000000000fffa
[604812.196052] R10: ffff880111220008 R11: ffff88020556d00c R12: 0000000000000010
[604812.196052] R13: ffffffff813d0fde R14: ffffffffffffff0c R15: ffff88020556d004
[604812.196052] FS:  00007f8996454800(0000) GS:ffff880207200000(0000) knlGS:0000000000000000
[604812.196052] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[604812.196052] CR2: 000000c81ff01c00 CR3: 00000001117d2000 CR4: 00000000001406f0
[604812.196052] Stack:
[604812.196052]  ffffffff813d2387 ffff88010fe70002 0000000000000010 0000000000d7ca8c
[604812.196052]  0000000000000001 ffff88020556d000 0000000000001000 ffffffffff0a0000
[604812.196052]  0000000052cb6175 ffff88020556e000 ffff880035f00780 ffff880035f00780
[604812.196052] Call Trace:
[604812.196052]  [<ffffffff813d2387>] ? vsnprintf+0x267/0x530
[604812.196052]  [<ffffffff81221f02>] seq_vprintf+0x32/0x60
[604812.196052]  [<ffffffff81221f7c>] seq_printf+0x4c/0x70
[604812.196052]  [<ffffffff81222806>] ? seq_path_root+0x66/0xd0
[604812.196052]  [<ffffffff8123e75f>] show_mountinfo+0x6f/0x2a0
[604812.196052]  [<ffffffff8121c9cc>] ? m_next+0x2c/0x40
[604812.196052]  [<ffffffff8121c507>] m_show+0x17/0x20
[604812.196052]  [<ffffffff8122198c>] seq_read+0xec/0x390
[604812.196052]  [<ffffffff811fd2d8>] __vfs_read+0x18/0x40
[604812.196052]  [<ffffffff811fd8a7>] vfs_read+0x87/0x130
[604812.196052]  [<ffffffff811fe625>] SyS_read+0x55/0xc0
[604812.196052]  [<ffffffff817f1c72>] entry_SYSCALL_64_fastpath+0x16/0x75
[604812.196052] Code: 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 a4 <c3> 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe 7c 

Had a similar issue to this including

Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [exe:15657]
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] Modules linked in: cls_fw sch_sfq sch_htb veth xt_nat xfrm_user xfrm_algo br_netfilter bridge stp llc overlay ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_round_robin nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_conntrack iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_addrtype iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_mark iptable_mangle ip_tables x_tables intel_rapl x86_pkg_temp_thermal coretemp dm_multipath crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache [last unloaded: scsi_transport_iscsi]
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] CPU: 1 PID: 15657 Comm: exe Not tainted 4.4.0-31-generic #50~14.04.1-Ubuntu
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] task: ffff88000219b200 ti: ffff880015010000 task.ti: ffff880015010000
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] RIP: e030:[<ffffffff813e1d78>]  [<ffffffff813e1d78>] memcpy_erms+0x8/0x10
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] RSP: e02b:ffff880015013d28  EFLAGS: 00000293
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] RAX: ffff88001481200d RBX: ffff88001481200d RCX: 0000000000000000
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] RDX: 0000000000000001 RSI: ffffffff81ad4cb8 RDI: ffff88001481200e
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] RBP: ffff880015013d88 R08: 000000000000000a R09: 000000000000fffc
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] R10: 0000000000000000 R11: ffff880015013cd8 R12: ffff880014813000
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] R13: ffff880015013db0 R14: ffffffff81ad4cb8 R15: ffffffff81ad4cb7
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] RDX: 0000000000000001 RSI: ffffffff81ad4cb8 RDI: ffff88001481200e
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] RBP: ffff880015013d88 R08: 000000000000000a R09: 000000000000fffc
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] R10: 0000000000000000 R11: ffff880015013cd8 R12: ffff880014813000
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] R13: ffff880015013db0 R14: ffffffff81ad4cb8 R15: ffffffff81ad4cb7
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] FS:  0000000000000000(0063) GS:ffff8801f5c80000(0000) knlGS:0000000000000000
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] CR2: 0000000000d22db0 CR3: 00000000054fb000 CR4: 0000000000002660
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] Stack:
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051]  ffffffff813dff61 000000000000ffff 000000000000fffc 0000000000000001
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051]  ffff880014812000 0000000000001000 ffffffffff0a0000 ffff880005eee600
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051]  ffff880005eee600 ffff880004000000 ffff8801f3e8b640 ffff880005eee600
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051] Call Trace:
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051]  [<ffffffff813dff61>] ? vsnprintf+0x251/0x510
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051]  [<ffffffff812209d8>] seq_vprintf+0x38/0x50
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051]  [<ffffffff81220a33>] seq_printf+0x43/0x50
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051]  [<ffffffff8123dc02>] show_mountinfo+0x62/0x270
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051]  [<ffffffff8121b5f7>] m_show+0x17/0x20
Jan 11 14:48:44 10.155.73.219 kernel: [4350276.112051]  [<ffffffff81220675>] seq_read+0xe5/0x350

However I was not yet able to reproduce on this kernel using the for loop mentioned above. I even worsened the load by putting -d and starting 3000 containers concurrently.

In the mean time I recommend setting the kernel to panic on soft lockup as well as reboot on panic to allow your systems to auto-recover to a certain degree. If you started your containers with restart=always they should come back up automatically.

Here is how you enable the flags:

sysctl kernel.softlockup_panic=1      # this will panic on soft lockup
sysctl kernel.panic=60                      # reset system on panic after 60 seconds

I’ve created a repo that reproduces this: https://github.com/CVTJNII/docker_lockup_test I had to wind up the concurrency from @AkihiroSuda’s numbers to get it to reproduce consistently, but it is consistent. Note that when the VirtualBox VM goes down it goes down hard. I recommend pre-instrumenting as after it hangs often the only thing to do is a hard reset. This is slightly different then the production failures where the host is usually still up but highly crippled, I’m chalking it up to the VirtualBox VM being smaller and loaded heavier to reduce the time to failure.