moby: kernel crash after "unregister_netdevice: waiting for lo to become free. Usage count = 3"

This happens when I login the container, and can’t quit by Ctrl-c.

My system is Ubuntu 12.04, kernel is 3.8.0-25-generic.

docker version:

root@wutq-docker:~# docker version
Client version: 0.10.0
Client API version: 1.10
Go version (client): go1.2.1
Git commit (client): dc9c28f
Server version: 0.10.0
Server API version: 1.10
Git commit (server): dc9c28f
Go version (server): go1.2.1
Last stable version: 0.10.0

I have used the script https://raw.githubusercontent.com/dotcloud/docker/master/contrib/check-config.sh to check, and all right.

I watch the syslog and found this message:

May  6 11:30:33 wutq-docker kernel: [62365.889369] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:30:44 wutq-docker kernel: [62376.108277] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:30:54 wutq-docker kernel: [62386.327156] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:02 wutq-docker kernel: [62394.423920] INFO: task docker:1024 blocked for more than 120 seconds.
May  6 11:31:02 wutq-docker kernel: [62394.424175] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  6 11:31:02 wutq-docker kernel: [62394.424505] docker          D 0000000000000001     0  1024      1 0x00000004
May  6 11:31:02 wutq-docker kernel: [62394.424511]  ffff880077793cb0 0000000000000082 ffffffffffffff04 ffffffff816df509
May  6 11:31:02 wutq-docker kernel: [62394.424517]  ffff880077793fd8 ffff880077793fd8 ffff880077793fd8 0000000000013f40
May  6 11:31:02 wutq-docker kernel: [62394.424521]  ffff88007c461740 ffff880076b1dd00 000080d081f06880 ffffffff81cbbda0
May  6 11:31:02 wutq-docker kernel: [62394.424526] Call Trace:                                                         
May  6 11:31:02 wutq-docker kernel: [62394.424668]  [<ffffffff816df509>] ? __slab_alloc+0x28a/0x2b2
May  6 11:31:02 wutq-docker kernel: [62394.424700]  [<ffffffff816f1849>] schedule+0x29/0x70
May  6 11:31:02 wutq-docker kernel: [62394.424705]  [<ffffffff816f1afe>] schedule_preempt_disabled+0xe/0x10
May  6 11:31:02 wutq-docker kernel: [62394.424710]  [<ffffffff816f0777>] __mutex_lock_slowpath+0xd7/0x150
May  6 11:31:02 wutq-docker kernel: [62394.424715]  [<ffffffff815dc809>] ? copy_net_ns+0x69/0x130
May  6 11:31:02 wutq-docker kernel: [62394.424719]  [<ffffffff815dc0b1>] ? net_alloc_generic+0x21/0x30
May  6 11:31:02 wutq-docker kernel: [62394.424724]  [<ffffffff816f038a>] mutex_lock+0x2a/0x50
May  6 11:31:02 wutq-docker kernel: [62394.424727]  [<ffffffff815dc82c>] copy_net_ns+0x8c/0x130
May  6 11:31:02 wutq-docker kernel: [62394.424733]  [<ffffffff81084851>] create_new_namespaces+0x101/0x1b0
May  6 11:31:02 wutq-docker kernel: [62394.424737]  [<ffffffff81084a33>] copy_namespaces+0xa3/0xe0
May  6 11:31:02 wutq-docker kernel: [62394.424742]  [<ffffffff81057a60>] ? dup_mm+0x140/0x240
May  6 11:31:02 wutq-docker kernel: [62394.424746]  [<ffffffff81058294>] copy_process.part.22+0x6f4/0xe60
May  6 11:31:02 wutq-docker kernel: [62394.424752]  [<ffffffff812da406>] ? security_file_alloc+0x16/0x20
May  6 11:31:02 wutq-docker kernel: [62394.424758]  [<ffffffff8119d118>] ? get_empty_filp+0x88/0x180
May  6 11:31:02 wutq-docker kernel: [62394.424762]  [<ffffffff81058a80>] copy_process+0x80/0x90
May  6 11:31:02 wutq-docker kernel: [62394.424766]  [<ffffffff81058b7c>] do_fork+0x9c/0x230
May  6 11:31:02 wutq-docker kernel: [62394.424769]  [<ffffffff816f277e>] ? _raw_spin_lock+0xe/0x20
May  6 11:31:02 wutq-docker kernel: [62394.424774]  [<ffffffff811b9185>] ? __fd_install+0x55/0x70
May  6 11:31:02 wutq-docker kernel: [62394.424777]  [<ffffffff81058d96>] sys_clone+0x16/0x20
May  6 11:31:02 wutq-docker kernel: [62394.424782]  [<ffffffff816fb939>] stub_clone+0x69/0x90
May  6 11:31:02 wutq-docker kernel: [62394.424786]  [<ffffffff816fb5dd>] ? system_call_fastpath+0x1a/0x1f
May  6 11:31:04 wutq-docker kernel: [62396.466223] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:14 wutq-docker kernel: [62406.689132] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:25 wutq-docker kernel: [62416.908036] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:35 wutq-docker kernel: [62427.126927] unregister_netdevice: waiting for lo to become free. Usage count = 3
May  6 11:31:45 wutq-docker kernel: [62437.345860] unregister_netdevice: waiting for lo to become free. Usage count = 3

After happend this, I open another terminal and kill this process, and then restart docker, but this will be hanged.

I reboot the host, and it still display that messages for some minutes when shutdown: screen shot 2014-05-06 at 11 49 27

About this issue

  • Original URL
  • State: closed
  • Created 10 years ago
  • Reactions: 281
  • Comments: 539 (93 by maintainers)

Commits related to this issue

Most upvoted comments

(repeating this https://github.com/moby/moby/issues/5618#issuecomment-351942943 here again, because GitHub is hiding old comments)

If you are arriving here

The issue being discussed here is a kernel bug and has not yet been fully fixed. Some patches went in the kernel that fix some occurrences of this issue, but others are not yet resolved.

There are a number of options that may help for some situations, but not for all (again; it’s most likely a combination of issues that trigger the same error)

The “unregister_netdevice: waiting for lo to become free” error itself is not the bug

If’s the kernel crash after that’s a bug (see below)

Do not leave “I have this too” comments

“I have this too” does not help resolving the bug. only leave a comment if you have information that may help resolve the issue (in which case; providing a patch to the kernel upstream may be the best step).

If you want to let know you have this issue too use the “thumbs up” button in the top description: screen shot 2017-03-09 at 16 12 17

If you want to stay informed on updates use the subscribe button.

screen shot 2017-03-09 at 16 11 03

Every comment here sends an e-mail / notification to over 3000 people I don’t want to lock the conversation on this issue, because it’s not resolved yet, but may be forced to if you ignore this.

I will be removing comments that don’t add useful information in order to (slightly) shorten the thread

If you want to help resolving this issue

  • Read the whole thread, including those comments that are hidden; it’s long, and github hides comments (so you’ll have to click to make those visible again). There’s a lot if information present in this thread already that could possibly help you
screen shot 2018-07-25 at 15 18 14

To be clear, the message itself is benign, it’s the kernel crash after the messages reported by the OP which is not.

The comment in the code, where this message is coming from, explains what’s happening. Basically every user, such as the IP stack) of a network device (such as the end of veth pair inside a container) increments a reference count in the network device structure when it is using the network device. When the device is removed (e,g. when the container is removed) each user is notified so that they can do some cleanup (e.g. closing open sockets etc) before decrementing the reference count. Because this cleanup can take some time, especially under heavy load (lot’s of interface, a lot of connections etc), the kernel may print the message here once in a while.

If a user of network device never decrements the reference count, some other part of the kernel will determine that the task waiting for the cleanup is stuck and it will crash. It is only this crash which indicates a kernel bug (some user, via some code path, did not decrement the reference count). There have been several such bugs and they have been fixed in modern kernel (and possibly back ported to older ones). I have written quite a few stress tests (and continue writing them) to trigger such crashes but have not been able to reproduce on modern kernels (i do however the above message).

** Please only report on this issue if your kernel actually crashes**, and then we would be very interested in:

  • kernel version (output of uname -r)
  • Linux distribution/version
  • Are you on the latest kernel version of your Linux vendor?
  • Network setup (bridge, overlay, IPv4, IPv6, etc)
  • Description of the workload (what type of containers, what type of network load, etc)
  • And ideally a simple reproduction

Thanks!

If you are arriving here

The issue being discussed here is a kernel bug and has not yet been fixed. There are a number of options that may help for some situations, but not for all (it’s most likely a combination of issues that trigger the same error)

Do not leave “I have this too” comments

“I have this too” does not help resolving the bug. only leave a comment if you have information that may help resolve the issue (in which case; providing a patch to the kernel upstream may be the best step).

If you want to let know you have this issue too use the “thumbs up” button in the top description: screen shot 2017-03-09 at 16 12 17

If you want to stay informed on updates use the subscribe button.

screen shot 2017-03-09 at 16 11 03

Every comment here sends an e-mail / notification to over 3000 people I don’t want to lock the conversation on this issue, because it’s not resolved yet, but may be forced to if you ignore this.

Thanks!

(repeating this here again, because GitHub is hiding old comments)

If you are arriving here

The issue being discussed here is a kernel bug and has not yet been fully fixed. Some patches went in the kernel that fix some occurrences of this issue, but others are not yet resolved.

There are a number of options that may help for some situations, but not for all (again; it’s most likely a combination of issues that trigger the same error)

Do not leave “I have this too” comments

“I have this too” does not help resolving the bug. only leave a comment if you have information that may help resolve the issue (in which case; providing a patch to the kernel upstream may be the best step).

If you want to let know you have this issue too use the “thumbs up” button in the top description: screen shot 2017-03-09 at 16 12 17

If you want to stay informed on updates use the subscribe button.

screen shot 2017-03-09 at 16 11 03

Every comment here sends an e-mail / notification to over 3000 people I don’t want to lock the conversation on this issue, because it’s not resolved yet, but may be forced to if you ignore this.

I will be removing comments that don’t add useful information in order to (slightly) shorten the thread

If you want to help resolving this issue

  • Read the whole thread; it’s long, and github hides comments (so you’ll have to click to make those visible again). There’s a lot if information present in this thread already that could possibly help you
  • Read this comment https://github.com/moby/moby/issues/5618#issuecomment-316297818 (and comments around that time) for information that can be helpful.

Thanks!

Hi all, I tried to debug the kernel issue, was having a email chain on the “netdev” mailing list, so just wanted to post some findings here.

https://www.spinics.net/lists/netdev/msg416310.html

The issue that we are seeing is that

unregister_netdevice: waiting for lo to become free. Usage count = 1

during container shut down. When if I inspect the container network namespace, it seems like the eth0 device has already been deleted, but only the lo device is left there. And there is another structure holding the reference for that device.

After some digging, it turns out the “thing” holding the reference, is one of the “routing cache” (struct dst_entry). And something is preventing that particular dst_entry to be gc’ed (the reference count for dst_entry is larger than 0). So I logged every dst_hold() (increment dst_entry reference count by 1), and dst_release() (decrement dst_entry reference count by 1), and there is indeed more dst_hold() calls then dst_release().

Here is the logs attached: kern.log.zip

Summary:

  • the lo interface was renamed to lodebug for ease of greping
  • the reference count for dst_entry starts with 1
  • the reference count for dst_entry (which is holding the reference for lo) at the end is 19
  • there are 258041 total dst_hold() calls, and 258023 total dst_release() calls
  • in the 258041 dst_hold() calls, there are 88034 udp_sk_rx_dst_set() (which is then calls dst_hold()), 152536 inet_sk_rx_dst_set(), and 17471 __sk_add_backlog()
  • in the 258023 dst_release() calls, there are 240551 inet_sock_destruct() and 17472 refdst_drop()

There are more udp_sk_rx_dst_set() and inet_sk_rx_dst_set() calls in total than inet_sock_destruct(), so suspecting there are some sockets are in a “limbo” state, and something preventing them to destroyed.

UPDATE: Turns out sockets (struct sock) are created and destroyed correctly, but for some of the TCP sockets inet_sk_rx_dst_set() are being called multiple times on the same dst, but there is only one corresponding inet_sock_destruct() to release the reference to the dst.

kernel bug image

=)

To be clear, the message itself is benign, it’s the kernel crash after the messages reported by the OP which is not.

The comment in the code, where this message is coming from, explains what’s happening. Basically every user, such as the IP stack) of a network device (such as the end of veth pair inside a container) increments a reference count in the network device structure when it is using the network device. When the device is removed (e,g. when the container is removed) each user is notified so that they can do some cleanup (e.g. closing open sockets etc) before decrementing the reference count. Because this cleanup can take some time, especially under heavy load (lot’s of interface, a lot of connections etc), the kernel may print the message here once in a while.

If a user of network device never decrements the reference count, some other part of the kernel will determine that the task waiting for the cleanup is stuck and it will crash. It is only this crash which indicates a kernel bug (some user, via some code path, did not decrement the reference count). There have been several such bugs and they have been fixed in modern kernel (and possibly back ported to older ones). I have written quite a few stress tests (and continue writing them) to trigger such crashes but have not been able to reproduce on modern kernels (i do however the above message).

Please only report on this issue if your kernel actually crashes, and then we would be very interested in:

  • kernel version (output of uname -r)
  • Linux distribution/version
  • Are you on the latest kernel version of your Linux vendor?
  • Network setup (bridge, overlay, IPv4, IPv6, etc)
  • Description of the workload (what type of containers, what type of network load, etc)
  • And ideally a simple reproduction

Thanks

[ @thaJeztah could you change the title to something like kernel crash after "unregister_netdevice: waiting for lo to become free. Usage count = 3" to make it more explicit]

This might be relevant: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1065434#yui_3_10_3_1_1401948176063_2050

Copying a fairly large amount of data over the network inside a container and then exiting the container can trigger a missing decrement in the per cpu reference count on a network device.

Sure enough, one of the times this happened for me was right after apt-getting a package with a ton of dependencies.

I believe I’ve fixed this issue, at least when caused by a kernel TCP socket connection. Test kernels for Ubuntu are available and I would love feedback if they help/fix this for anyone here. Patch is submitted upstream; more details are in LP bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/comments/46

@piec, many thanks for the details. I have a few more questions for you at the end of this very long comment.

Using the SMB setup I was able to produce a number of things with different kernels. I’ve tried this with the NFS setup as well but no dice.

All tests are run with docker 17.06.1-ce on HyperKit with a VM configured with 2 vCPUs and 2GB of memory (via Docker for Mac, but that should not matter). I’m using LinuxKit kernels, because I can easily swap them out.

I modified your Dockerfile in that I added a call to date as the first command executed and also added a call to date before andeafter the docker run for the client.

Experiment 1 (4.9.39 kernel)

With 4.9.39 (latest 4.9.x stable kernel) I get a kernel crash:

# while true; do  date;    docker run -it --rm --name client-smb --cap-add=SYS_ADMIN --cap-add DAC_READ_SEARCH --link samba:samba client-smb:1;  date;   sleep 1; done
Thu 27 Jul 2017 14:12:51 BST
+ date
Thu Jul 27 13:12:52 UTC 2017
+ mount.cifs //172.17.0.2/public /mnt/ -o vers=3.0,user=nobody,password=
+ date
Thu Jul 27 13:12:52 UTC 2017
+ ls -la /mnt
total 1028
drwxr-xr-x    2 root     root             0 Jul 27 10:11 .
drwxr-xr-x    1 root     root          4096 Jul 27 13:12 ..
-rwxr-xr-x    1 root     root             3 Jul 27 10:11 bla
+ umount /mnt
+ echo umount ok
umount ok
Thu 27 Jul 2017 14:12:52 BST
Thu 27 Jul 2017 14:12:53 BST

---> First iteration suceeds and then hangs on the docker run

and in dmesg:

[  268.347598] BUG: unable to handle kernel paging request at 0000000100000015
[  268.348072] IP: [<ffffffff8c64ea95>] sk_filter_uncharge+0x5/0x31
[  268.348411] PGD 0 [  268.348517]
[  268.348614] Oops: 0000 [#1] SMP
[  268.348789] Modules linked in:
[  268.348971] CPU: 1 PID: 2221 Comm: vsudd Not tainted 4.9.39-linuxkit #1
[  268.349330] Hardware name:   BHYVE, BIOS 1.00 03/14/2014
[  268.349620] task: ffff8b6ab8eb5100 task.stack: ffffa015c113c000
[  268.349995] RIP: 0010:[<ffffffff8c64ea95>]  [<ffffffff8c64ea95>] sk_filter_uncharge+0x5/0x31
[  268.350509] RSP: 0018:ffffa015c113fe10  EFLAGS: 00010202
[  268.350818] RAX: 0000000000000000 RBX: ffff8b6ab7eee6a8 RCX: 0000000000000006
[  268.351231] RDX: 00000000ffffffff RSI: 00000000fffffffd RDI: ffff8b6ab7eee400
[  268.351636] RBP: ffff8b6ab7eee400 R08: 0000000000000000 R09: 0000000000000000
[  268.352022] R10: ffffa015c101fcb0 R11: 0000000000000000 R12: 0000000000000000
[  268.352409] R13: ffff8b6ab7eee4a8 R14: ffff8b6ab7f8e340 R15: 0000000000000000
[  268.352796] FS:  00007f03f62e3eb0(0000) GS:ffff8b6abc700000(0000) knlGS:0000000000000000
[  268.353234] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  268.353546] CR2: 0000000100000015 CR3: 00000000782d2000 CR4: 00000000000406a0
[  268.353961] Stack:
[  268.354106]  ffffffff8c625054 ffff8b6ab7eee400 ffffa015c113fe88 0000000000000000
[  268.354526]  ffffffff8c74ed96 01000008bc718980 0000000000000000 0000000000000000
[  268.354965]  de66927a28223151 ffff8b6ab4443a40 ffffa015c101fcb0 ffff8b6ab4443a70
[  268.355384] Call Trace:
[  268.355523]  [<ffffffff8c625054>] ? __sk_destruct+0x35/0x133
[  268.355822]  [<ffffffff8c74ed96>] ? unix_release_sock+0x1df/0x212
[  268.356164]  [<ffffffff8c74ede2>] ? unix_release+0x19/0x25
[  268.356454]  [<ffffffff8c62034c>] ? sock_release+0x1a/0x6c
[  268.356742]  [<ffffffff8c6203ac>] ? sock_close+0xe/0x11
[  268.357019]  [<ffffffff8c1f8710>] ? __fput+0xdd/0x17b
[  268.357288]  [<ffffffff8c0f604d>] ? task_work_run+0x64/0x7a
[  268.357583]  [<ffffffff8c003285>] ? prepare_exit_to_usermode+0x7d/0xa4
[  268.357925]  [<ffffffff8c7d2884>] ? entry_SYSCALL_64_fastpath+0xa7/0xa9
[  268.358268] Code: 08 4c 89 e7 e8 fb f8 ff ff 48 3d 00 f0 ff ff 77 06 48 89 45 00 31 c0 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 44 00 00 <48> 8b 46 18 8b 40 04 48 8d 04 c5 28 00 00 00 f0 29 87 24 01 00
[  268.359776] RIP  [<ffffffff8c64ea95>] sk_filter_uncharge+0x5/0x31
[  268.360118]  RSP <ffffa015c113fe10>
[  268.360311] CR2: 0000000100000015
[  268.360550] ---[ end trace 4a7830b42d5acfb3 ]---
[  268.360861] Kernel panic - not syncing: Fatal exception
[  268.361217] Kernel Offset: 0xb000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  268.361789] Rebooting in 120 seconds..

Some times I see several iteration of what the 4.11.12 kernel does, including the unregister_netdevice messages (see below) and then get the kernel crash above. Sometimes I see a slight variations for the crash, like:

[  715.926694] BUG: unable to handle kernel paging request at 00000000fffffdc9
[  715.927380] IP: [<ffffffff8664ea95>] sk_filter_uncharge+0x5/0x31
[  715.927868] PGD 0 [  715.928022]
[  715.928174] Oops: 0000 [#1] SMP
[  715.928424] Modules linked in:
[  715.928703] CPU: 0 PID: 2665 Comm: runc:[0:PARENT] Not tainted 4.9.39-linuxkit #1
[  715.929321] Hardware name:   BHYVE, BIOS 1.00 03/14/2014
[  715.929765] task: ffff931538ef4140 task.stack: ffffbcbbc0214000
[  715.930279] RIP: 0010:[<ffffffff8664ea95>]  [<ffffffff8664ea95>] sk_filter_uncharge+0x5/0x31
[  715.931043] RSP: 0018:ffffbcbbc0217be0  EFLAGS: 00010206
[  715.931487] RAX: 0000000000000000 RBX: ffff931532a662a8 RCX: 0000000000000006
[  715.932043] RDX: 00000000ffffffff RSI: 00000000fffffdb1 RDI: ffff931532a66000
[  715.932612] RBP: ffff931532a66000 R08: 0000000000000000 R09: 0000000000000000
[  715.933181] R10: ffff9315394f2990 R11: 000000000001bb68 R12: ffff931532a66000
[  715.933725] R13: ffff9315328060a8 R14: ffff931532a66340 R15: 0000000000000000
[  715.934258] FS:  0000000000000000(0000) GS:ffff93153c600000(0000) knlGS:0000000000000000
[  715.934857] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  715.935286] CR2: 00000000fffffdc9 CR3: 0000000052c09000 CR4: 00000000000406b0
[  715.935822] Stack:
[  715.935974]  ffffffff86625054 ffff931532806000 ffffbcbbc0217c58 ffff931532a66000
[  715.936560]  ffffffff8674ed37 0100000800000282 0000000000000000 0000000000000000
[  715.937173]  5de0b9a3a313c00b ffff9315346f5080 ffff9315394f2990 ffff9315346f50b0
[  715.937751] Call Trace:
[  715.937982]  [<ffffffff86625054>] ? __sk_destruct+0x35/0x133
[  715.938608]  [<ffffffff8674ed37>] ? unix_release_sock+0x180/0x212
[  715.939130]  [<ffffffff8674ede2>] ? unix_release+0x19/0x25
[  715.939517]  [<ffffffff8662034c>] ? sock_release+0x1a/0x6c
[  715.939907]  [<ffffffff866203ac>] ? sock_close+0xe/0x11
[  715.940277]  [<ffffffff861f8710>] ? __fput+0xdd/0x17b
[  715.940635]  [<ffffffff860f604d>] ? task_work_run+0x64/0x7a
[  715.941072]  [<ffffffff860e148a>] ? do_exit+0x42a/0x8e0
[  715.941472]  [<ffffffff8674edfa>] ? scm_destroy+0xc/0x25
[  715.941880]  [<ffffffff867504e0>] ? unix_stream_sendmsg+0x2dd/0x30b
[  715.942357]  [<ffffffff860e19aa>] ? do_group_exit+0x3c/0x9d
[  715.942780]  [<ffffffff860eac41>] ? get_signal+0x45d/0x4e2
[  715.943210]  [<ffffffff86621640>] ? sock_sendmsg+0x2d/0x3c
[  715.943618]  [<ffffffff8602055a>] ? do_signal+0x36/0x4c9
[  715.944017]  [<ffffffff861f64c7>] ? __vfs_write+0x8f/0xcc
[  715.944416]  [<ffffffff861f7100>] ? vfs_write+0xbb/0xc7
[  715.944809]  [<ffffffff8600326c>] ? prepare_exit_to_usermode+0x64/0xa4
[  715.945295]  [<ffffffff867d2884>] ? entry_SYSCALL_64_fastpath+0xa7/0xa9
[  715.945789] Code: 08 4c 89 e7 e8 fb f8 ff ff 48 3d 00 f0 ff ff 77 06 48 89 45 00 31 c0 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 44 00 00 <48> 8b 46 18 8b 40 04 48 8d 04 c5 28 00 00 00 f0 29 87 24 01 00
[  715.947701] RIP  [<ffffffff8664ea95>] sk_filter_uncharge+0x5/0x31
[  715.948112]  RSP <ffffbcbbc0217be0>
[  715.948292] CR2: 00000000fffffdc9
[  715.948467] ---[ end trace 2d69bea56725fd5f ]---
[  715.948722] Kernel panic - not syncing: Fatal exception
[  715.949059] Kernel Offset: 0x5000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  715.949595] Rebooting in 120 seconds..

The crashes are in the unix domain socket code and similar/identical to what is reported here, though with this new test case it is much easier to reproduce.

Experiment 2 (4.11.12 kernel)

With 4.11.12 (which is the latest stable in the 4.11 series) I see no crashes, but it is really slow (annotations inline with --->):

# while true; do  date;    docker run -it --rm --name client-smb --cap-add=SYS_ADMIN --cap-add DAC_READ_SEARCH --link samba:samba client-smb:1;  date;   sleep 1; done
Thu 27 Jul 2017 13:48:04 BST
+ date
Thu Jul 27 12:48:05 UTC 2017
+ mount.cifs //172.17.0.2/public /mnt/ -o vers=3.0,user=nobody,password=
+ date
Thu Jul 27 12:48:05 UTC 2017
+ ls -la /mnt
total 1028
drwxr-xr-x    2 root     root             0 Jul 27 10:11 .
drwxr-xr-x    1 root     root          4096 Jul 27 12:48 ..
-rwxr-xr-x    1 root     root             3 Jul 27 10:11 bla
+ umount /mnt
+ echo umount ok
umount ok
Thu 27 Jul 2017 13:48:05 BST

---> First iteration takes one second

Thu 27 Jul 2017 13:48:06 BST
docker: Error response from daemon: containerd: container did not start before the specified timeout.
Thu 27 Jul 2017 13:50:07 BST

---> Second iteration fails after 2 minutes with dockerd unable to start the container

Thu 27 Jul 2017 13:50:08 BST
+ date
Thu Jul 27 12:51:52 UTC 2017
+ mount.cifs //172.17.0.2/public /mnt/ -o vers=3.0,user=nobody,password=
+ date
Thu Jul 27 12:51:53 UTC 2017
+ ls -la /mnt
total 1028
drwxr-xr-x    2 root     root             0 Jul 27 10:11 .
drwxr-xr-x    1 root     root          4096 Jul 27 12:50 ..
-rwxr-xr-x    1 root     root             3 Jul 27 10:11 bla
+ umount /mnt
+ echo umount ok
umount ok
Thu 27 Jul 2017 13:51:53 BST

---> Third iterations succeeds, BUT it takes almost 2 minutes between docker run and the container running

Thu 27 Jul 2017 13:51:54 BST
docker: Error response from daemon: containerd: container did not start before the specified timeout.
Thu 27 Jul 2017 13:53:55 BST

---> Fourth iteration fails after two minutes

Thu 27 Jul 2017 13:53:56 BST
+ date
Thu Jul 27 12:55:37 UTC 2017
+ mount.cifs //172.17.0.2/public /mnt/ -o vers=3.0,user=nobody,password=
+ date
Thu Jul 27 12:55:37 UTC 2017
+ ls -la /mnt
total 1028
drwxr-xr-x    2 root     root             0 Jul 27 10:11 .
drwxr-xr-x    1 root     root          4096 Jul 27 12:53 ..
-rwxr-xr-x    1 root     root             3 Jul 27 10:11 bla
+ umount /mnt
+ echo umount ok
umount ok
Thu 27 Jul 2017 13:55:38 BST

---> Fifth iteration succeeds, but almost 2 minutes between docker run and the container executing

I had this running for an hour or so with the same pattern repeating, but no kernel crash.

I the kernel logs I see, lots of:

[   84.940380] unregister_netdevice: waiting for lo to become free. Usage count = 1
[   95.082151] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  105.253289] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  115.477095] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  125.627059] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  135.789298] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  145.969455] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  156.101126] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  166.303333] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  176.445791] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  186.675958] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  196.870265] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  206.998238] unregister_netdevice: waiting for lo to become free. Usage count = 1
[...]

That is a message every ten seconds.

Since this does not cause the hung task detection to kick in even after an hour, I suspect that with 4.11.12 the reference count eventually gets decremented and the device get’s freed, but, judging by the intervals I can run containers, it might take up to 4mins!

Experiment 3 (4.11.12 kernel)

The kernel crash in the OP indicated that the kernel crashed because a hung task was detected. I haven not seen this crash in my testing, so I changed the sysctl setting related to hung task detection:

# sysctl -a | grep kernel.hung_task
kernel.hung_task_check_count = 4194304
kernel.hung_task_panic = 0
kernel.hung_task_timeout_secs = 120
kernel.hung_task_warnings = 10
# sysctl -w kernel.hung_task_timeout_secs = 60
# sysctl -w kernel.hung_task_panic=1

This reduces the timeout to 60 seconds and panics the kernel if a hung task was detected. Since it takes around 2 minutes before dockerd complained that containerd did not start, reducing the hung task detection to 60s ought to trigger a kernel panics if a single task was hung. Alas there was no crash in the logs

Experiment 4 (4.11.12 kernel)

Next, I increase the sleep after each docker run to 5 minutes to see if the messages are continuous. In this case all docker runs seem to work, which is kinda expected since from the previous experiments a docker run would work every 4 minutes or so

---> This is after the first run
[  281.406660] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  291.455945] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  301.721340] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  311.988572] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  322.258805] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  332.527383] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  342.796511] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  353.059499] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  363.327472] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  373.365562] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  383.635923] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  393.684949] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  403.950186] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  414.221779] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  424.490110] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  434.754925] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  445.022243] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  455.292106] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  465.557462] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  475.826946] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  486.097833] unregister_netdevice: waiting for lo to become free. Usage count = 1

---> 200+ seconds of messages and then nothing for almost 400 seconds

[  883.924399] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  893.975810] unregister_netdevice: waiting for lo to become free. Usage count = 1
...
[ 1088.624065] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 1098.891297] unregister_netdevice: waiting for lo to become free. Usage count = 1

---> 200+ seconds of messages and then a gap of 90 seconds

[ 1185.119327] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 1195.387962] unregister_netdevice: waiting for lo to become free. Usage count = 1
...
[ 1390.040035] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 1400.307359] unregister_netdevice: waiting for lo to become free. Usage count = 1

---> 200+ seconds of messages and then a gap of 80+ seconds

[ 1486.325724] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 1496.591715] unregister_netdevice: waiting for lo to become free. Usage count = 1
...
[ 1680.987216] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 1691.255068] unregister_netdevice: waiting for lo to become free. Usage count = 1

---> 200+ seconds of messages and then a gap of 90+ seconds

[ 1787.547334] unregister_netdevice: waiting for lo to become free. Usage count = 1
[ 1797.819703] unregister_netdevice: waiting for lo to become free. Usage count = 1

It looks like we are getting around 200 seconds worth of unregister_netdevice on almost every docker run (except for the second one). I suspect during that time we can’t start new containers (as indicated by Experiment 2. It’s curious that the hung task detection is not kicking in, presumably because no task is hung.

Experiment 5 (4.11.12/4.9.39 with a extra debugging enable in the kernel)

This is reverting back to 1s sleep in between docker run

We have another kernel which enabled a bunch of additional debug options, such as LOCKDEP, RCU_TRACE, LOCKUP_DETECTOR and a few more.

Running the repro 4.11.12 kernels with these debug options enabled did not trigger anything.

Ditto for the 4.9.39 kernel, where the normal kernel crashes. The debug options change the timing slightly, so this maybe an additional clue that the crash in the unix domain socket code shows about is due to a race.

Digging a bit deeper

strace on the various containerd processes is not helpful (it usually isn’t because it’s written in Go). Lots of long stalls in futex(...FUTEX_WAIT...) with any information on where/why.

Some poking around with sysrq:

Increase verbosity:

echo 9 > /proc/sysrq-trigger

Stack trace from all CPUs:

echo l > /proc/sysrq-trigger
[ 1034.298202] sysrq: SysRq : Show backtrace of all active CPUs
[ 1034.298738] NMI backtrace for cpu 1
[ 1034.299073] CPU: 1 PID: 2235 Comm: sh Tainted: G    B           4.11.12-linuxkit #1
[ 1034.299818] Hardware name:   BHYVE, BIOS 1.00 03/14/2014
[ 1034.300286] Call Trace:
[ 1034.300517]  dump_stack+0x82/0xb8
[ 1034.300827]  nmi_cpu_backtrace+0x75/0x87
[ 1034.301200]  ? irq_force_complete_move+0xf1/0xf1
[ 1034.301633]  nmi_trigger_cpumask_backtrace+0x6e/0xfd
[ 1034.302097]  arch_trigger_cpumask_backtrace+0x19/0x1b
[ 1034.302560]  ? arch_trigger_cpumask_backtrace+0x19/0x1b
[ 1034.302989]  sysrq_handle_showallcpus+0x17/0x19
[ 1034.303438]  __handle_sysrq+0xe4/0x172
[ 1034.303826]  write_sysrq_trigger+0x47/0x4f
[ 1034.304210]  proc_reg_write+0x5d/0x76
[ 1034.304507]  __vfs_write+0x35/0xc8
[ 1034.304773]  ? rcu_sync_lockdep_assert+0x12/0x52
[ 1034.305132]  ? __sb_start_write+0x152/0x189
[ 1034.305458]  ? file_start_write+0x27/0x29
[ 1034.305770]  vfs_write+0xda/0x100
[ 1034.306029]  SyS_write+0x5f/0xa3
[ 1034.306283]  entry_SYSCALL_64_fastpath+0x1f/0xc2
[ 1034.306638] RIP: 0033:0x7fa4810488a9
[ 1034.306976] RSP: 002b:00007fffd3a29828 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 1034.307567] RAX: ffffffffffffffda RBX: 000000c6b523a020 RCX: 00007fa4810488a9
[ 1034.308101] RDX: 0000000000000002 RSI: 000000c6b5239d00 RDI: 0000000000000001
[ 1034.308635] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 1034.309169] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 1034.309700] R13: 0000000000000000 R14: 00007fffd3a29988 R15: 00007fa481280ee0
[ 1034.310334] Sending NMI from CPU 1 to CPUs 0:
[ 1034.310710] NMI backtrace for cpu 0 skipped: idling at pc 0xffffffffa0922756

Nothing here, CPU1 is idle, CPU0 is handling the sysrq.

Show blocked tasks (twice)

echo w > /proc/sysrq-trigger
[  467.167062] sysrq: SysRq : Show Blocked State
[  467.167731]   task                        PC stack   pid father
[  467.168580] kworker/u4:6    D    0   293      2 0x00000000
[  467.169096] Workqueue: netns cleanup_net
[  467.169487] Call Trace:
[  467.169732]  __schedule+0x582/0x701
[  467.170073]  schedule+0x89/0x9a
[  467.170338]  schedule_timeout+0xbf/0xff
[  467.170666]  ? del_timer_sync+0xc1/0xc1
[  467.171011]  schedule_timeout_uninterruptible+0x2a/0x2c
[  467.171422]  ? schedule_timeout_uninterruptible+0x2a/0x2c
[  467.171866]  msleep+0x1e/0x22
[  467.172155]  netdev_run_todo+0x173/0x2c4
[  467.172499]  rtnl_unlock+0xe/0x10
[  467.172770]  default_device_exit_batch+0x13c/0x15f
[  467.173226]  ? __wake_up_sync+0x12/0x12
[  467.173550]  ops_exit_list+0x29/0x53
[  467.173850]  cleanup_net+0x1a8/0x261
[  467.174153]  process_one_work+0x276/0x4fb
[  467.174487]  worker_thread+0x1eb/0x2ca
[  467.174800]  ? rescuer_thread+0x2d9/0x2d9
[  467.175136]  kthread+0x106/0x10e
[  467.175406]  ? __list_del_entry+0x22/0x22
[  467.175737]  ret_from_fork+0x2a/0x40
[  467.176167] runc:[1:CHILD]  D    0  2609   2606 0x00000000
[  467.176636] Call Trace:
[  467.176849]  __schedule+0x582/0x701
[  467.177152]  schedule+0x89/0x9a
[  467.177451]  schedule_preempt_disabled+0x15/0x1e
[  467.177827]  __mutex_lock+0x2a0/0x3ef
[  467.178133]  ? copy_net_ns+0xbb/0x17c
[  467.178456]  mutex_lock_killable_nested+0x1b/0x1d
[  467.179068]  ? mutex_lock_killable_nested+0x1b/0x1d
[  467.179489]  copy_net_ns+0xbb/0x17c
[  467.179798]  create_new_namespaces+0x12b/0x19b
[  467.180151]  unshare_nsproxy_namespaces+0x8f/0xaf
[  467.180569]  SyS_unshare+0x17b/0x302
[  467.180925]  entry_SYSCALL_64_fastpath+0x1f/0xc2
[  467.181303] RIP: 0033:0x737b97
[  467.181559] RSP: 002b:00007fff1965ab18 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
[  467.182182] RAX: ffffffffffffffda RBX: 0000000002277bd8 RCX: 0000000000737b97
[  467.182805] RDX: 0000000000000000 RSI: 0000000000867a0f RDI: 000000006c020000
[  467.183368] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  467.184014] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  467.184639] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  477.286653] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  487.457828] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  497.659654] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  507.831614] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  518.030241] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  528.232963] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  538.412263] unregister_netdevice: waiting for lo to become free. Usage count = 1
[  548.583610] unregister_netdevice: waiting for lo to become free. Usage count = 1
echo w > /proc/sysrq-trigger
[  553.969592] sysrq: SysRq : Show Blocked State
[  553.970411]   task                        PC stack   pid father
[  553.971208] kworker/u4:6    D    0   293      2 0x00000000
[  553.971686] Workqueue: netns cleanup_net
[  553.972058] Call Trace:
[  553.972305]  __schedule+0x582/0x701
[  553.972690]  schedule+0x89/0x9a
[  553.973039]  schedule_timeout+0xbf/0xff
[  553.973462]  ? del_timer_sync+0xc1/0xc1
[  553.973890]  schedule_timeout_uninterruptible+0x2a/0x2c
[  553.974706]  ? schedule_timeout_uninterruptible+0x2a/0x2c
[  553.975244]  msleep+0x1e/0x22
[  553.975539]  netdev_run_todo+0x173/0x2c4
[  553.975950]  rtnl_unlock+0xe/0x10
[  553.976303]  default_device_exit_batch+0x13c/0x15f
[  553.976725]  ? __wake_up_sync+0x12/0x12
[  553.977121]  ops_exit_list+0x29/0x53
[  553.977501]  cleanup_net+0x1a8/0x261
[  553.977869]  process_one_work+0x276/0x4fb
[  553.978245]  worker_thread+0x1eb/0x2ca
[  553.978578]  ? rescuer_thread+0x2d9/0x2d9
[  553.978933]  kthread+0x106/0x10e
[  553.979283]  ? __list_del_entry+0x22/0x22
[  553.979774]  ret_from_fork+0x2a/0x40
[  553.980244] runc:[1:CHILD]  D    0  2609   2606 0x00000000
[  553.980728] Call Trace:
[  553.980949]  __schedule+0x582/0x701
[  553.981254]  schedule+0x89/0x9a
[  553.981533]  schedule_preempt_disabled+0x15/0x1e
[  553.981917]  __mutex_lock+0x2a0/0x3ef
[  553.982220]  ? copy_net_ns+0xbb/0x17c
[  553.982524]  mutex_lock_killable_nested+0x1b/0x1d
[  553.982909]  ? mutex_lock_killable_nested+0x1b/0x1d
[  553.983311]  copy_net_ns+0xbb/0x17c
[  553.983606]  create_new_namespaces+0x12b/0x19b
[  553.983977]  unshare_nsproxy_namespaces+0x8f/0xaf
[  553.984363]  SyS_unshare+0x17b/0x302
[  553.984663]  entry_SYSCALL_64_fastpath+0x1f/0xc2
[  553.985080] RIP: 0033:0x737b97
[  553.985306] RSP: 002b:00007fff1965ab18 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
[  553.985861] RAX: ffffffffffffffda RBX: 0000000002277bd8 RCX: 0000000000737b97
[  553.986383] RDX: 0000000000000000 RSI: 0000000000867a0f RDI: 000000006c020000
[  553.986811] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  553.987182] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  553.987551] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
/ # [  558.844761] unregister_netdevice: waiting for lo to become free. Usage count = 1

This shows that both the netns and cleanup_net work queues are busy. I found a somewhat related issue a quite a while back here, but this time the cleanup_net workqueue is in a different state.

Summary

  • I think the crash on 4.9.x kernels is unrelated but seems to be fixed in 4.11.x. This needs a bit more triage.
  • Unlike some previous reports, no hung tasks are reported, but it’s hard to tell because there are very few stack traces on this issue.
  • Something is blocked for a very long time (2-4 minutes). It’s likely related to the workqueue state
  • The dump of the workqueue needs more analysis, in particualr why the workqueue stay in that state that long.
  • The unregister_netdev messages seem unrelated to the recent fix (which is in both 4.9.39 and 4.11.12). This maybe because the cleanup_net work queue is not progressing and thus the message is printed.
  • It’s entirely unclear how/why SMB is triggering this. I’ve written some 60 odd stress tests for network namespaces and different workloads and was unable to trigger any issues. The tests were based on runc, maybe I should try containerd.

I will dig a bit more and then send a summary to netdev.

@piec do you have console access and can see if there is anything in terms of crash dump or do you also just see huge delays as I see? If you have a crash dump, I’d be very interested in seeing it. Also, are you running on bare metal or in a VM? What’s your configuration in terms of CPUs and memory?

Hi guys,

There is a potential patch for the kernel bug (or at least one of the bugs) in the Linux net-dev mailing list:

https://www.spinics.net/lists/netdev/msg442211.html

It’s merged in net tree, queued for stable tree.

Smyte will pay $5000 USD for the resolution of this issue. Sounds like I need to talk to someone who works on the kernel?

same here! 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢 😢

same problem.

[root@docker1 ~]# cat /etc/redhat-release 
CentOS Linux release 7.3.1611 (Core) 
[root@docker1 ~]# uname  -r
3.10.0-514.26.2.el7.x86_64
[root@docker1 ~]# docker version
Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-32.git88a4867.el7.centos.x86_64
 Go version:      go1.7.4
 Git commit:      88a4867/1.12.6
 Built:           Mon Jul  3 16:02:02 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-1.12.6-32.git88a4867.el7.centos.x86_64
 Go version:      go1.7.4
 Git commit:      88a4867/1.12.6
 Built:           Mon Jul  3 16:02:02 2017
 OS/Arch:         linux/amd64

According to https://github.com/torvalds/linux/commit/d747a7a51b00984127a88113cdbbc26f91e9d815 - it is in 4.12 (which was released yesterday)!

The issue was assigned 10 days ago and it is work in progress, you can see more insights of what’s going on here https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407

Hopefully Dan Streetman finds out how to fix it

kernel: unregister_netdevice: waiting for eth0 to become free. Usage count = 3 kernel 4.4.146-1.el7.elrepo.x86_64 linux version CentOS Linux release 7.4.1708 (Core) bridge mode

I had the same issue,what can i do?

@samvignoli your comments are not constructive. Please stop posting.

https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.4.114

commit edaafa805e0f9d09560a4892790b8e19cab8bf09 Author: Dan Streetman ddstreet@ieee.org Date: Thu Jan 18 16:14:26 2018 -0500

net: tcp: close sock if net namespace is exiting


[ Upstream commit 4ee806d51176ba7b8ff1efd81f271d7252e03a1d ]

When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.

For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open.  However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open.  In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence.  The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:

unregister_netdevice: waiting for lo to become free. Usage count = 1

These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.

After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.

Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman <ddstreet@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

While I’m not the one who is fixing this issue, not being much into Linux Kernel dev, I think I am right in saying that the “me too” comments aren’t that helpful. By this I mean, just saying “I have this problem too, with Kernel vx.x and Docker 1.x” does not bring anything new to the discussion.

However, I would suggest that “me too” comments which describe more the environment and method to reproduce would be of great value.

When reading all the comments, it is clear that there are a few problems - as I posted earlier, some with vethXYZ, some with eth0 and others with lo0. This suggest that they could be caused by different problems. So just saying “me too” without full description of the error and environment may mislead people.

Also, when describing the environment, giving the Kernel and Docker version is not sufficient. Per the thread, there seems to be a few factors such as ipv6 enabled or not. NodeJS not responding to SIGINT (or other containers, not bashing on NodeJS here).

So describing what the workload on the environment is would be useful. Also, this occurs when a container is being shutdown, therefore I would also suggest to the people experiencing this issue to pay attention to what container is being stopped when the problem rear its ugly head.

While it seems the problem is in the Kernel having a race condition - identifying the trigger will be of tremendous help to those who will fix the issue. And it can even give the affected users an immediate solution such as implementing a signal handler in a NodeJS application (I don’t know myself that this prevents the issue from triggering, but it seems so per earlier comments of others).

Thank you to coolljt0725 and co (and everybody in this thread). Since many people will be unable to update to a kernel with the ipv6 patch for some time, (everyone, currently) I’ve managed to squash this bug after trying many of the suggestions from this thread. I want to make a full post to follow up on things that did work and did not work so that nobody else has to see the trouble I seen.

TL;DR disable ipv6 in linux boot params, reboot. on coreos this means /usr/share/oem/grub.cfg has the contents: set linux_append="ipv6.disable=1" and then a reboot. a more general purpose suggestion that should work on centos/ubuntu/debian/$linuxes may be found here

  • tried ipvlan(l2,l3)/macvlan(bridge): neither of these work on aws, or at least, I do not possess nor could I find the knowledge to finagle either of them to work on aws. by work, I mean, starting a container attached on a network with ipvlan or macvlan, was unable to ping the gateway / connect to internet (yes, I tested the basic idea working on non-aws environment). This did in fact seem to resolve the issue at hand, but for our use case we need to be able to connect to the internet – for use cases that do not, this may be a viable option and looks pretty sexy vs. the bridge.
  • tried the following flags passed to dockerd individually, and with certain combinations (since none of them seemed to work, I wasn’t too scientific about trying any and all combinations):
--ipv6=false
—iptables=false
—ip-forward=false
—icc=false
—ip-masq=false
—userland-proxy=false

interestingly, --ipv6=false doesn’t really seem to do anything – this was quite perplexing, containers still received inet6 addresses with this flag.

--userland-proxy=false sets hairpin mode and wasn’t expected to work really. in conjunction with this I had some hope but this did not resolve the issue, either (setting docker0 to promisc mode). There is a mention of a fix to --userland-proxy=false here and this may be upstream soon and worth another shot, it would be nice to turn this off regardless of the bug noted in this issue for performance but unfortunately it has yet another bug at this time.

  • tried disabling ipv6 via sysctl settings as documented here and restarting systemd-networkd after applying sysctl settings, as well as attempting to disable ipv6 from dhcpcd as documented here and this was not ample to disable ipv6 as it’s still turned on, even if no interfaces are using it.
  • as was suggested here we gave this solution a try, only removing one container at a time, and we were still met with this bug.

too long; did read: disable ipv6 in your grub settings. reboot. profit.

Is there some official word from Docker 🐳 on when this might be looked at? This is second most commented open issue; is very severe (necessitating a host restart); is reproducible; and I don’t see any real progress toward pinning down the root cause or fixing it 😞.

This seems most likely to be a kernel issue, but the ticket on Bugzilla has been stagnant for months. Would it be helpful to post our test cases there?

Everyone who’s seeing this error on their system is running a package of the Linux kernel on their distribution that’s far too old and lacks the fixes for this particular problem.

If you run into this problem, make sure you run apt-get update && apt-get dist-upgrade -y and reboot your system. If you’re on Digital Ocean, you also need to select the kernel version which was just installed during the update because they don’t use the latest kernel automatically (see https://digitalocean.uservoice.com/forums/136585-digitalocean/suggestions/2814988-give-option-to-use-the-droplet-s-own-bootloader).

CentOS/RHEL/Fedora/Scientific Linux users need to keep their systems updated using yum update and reboot after installing the updates.

When reporting this problem, please make sure your system is fully patched and up to date with the latest stable updates (no manually installed experimental/testing/alpha/beta/rc packages) provided by your distribution’s vendor.

Same issue:

CentOS Linux release 7.5.1804 (Core) Docker version 18.06.1-ce, build e68fc7a Kernel Version: 3.10.0-693.el7.x86_64

Everyone please understand this is a common SYMPTOM that has many causes. What has worked for you to avoid this may not work for someone else.

Red Hat claims to have an instance of this bug fixed as of kernel-3.10.0-514.21.1.el7 release. I suppose they will upstream the fix as soon as possible and rebase to 4.12. This package is already available on CentOS 7 as well.

Documentation related to the fix (RHN access needed): https://access.redhat.com/articles/3034221 https://bugzilla.redhat.com/show_bug.cgi?id=1436588

From the article: “In case of a duplicate IPv6 address or an issue with setting an address, a race condition occurred. This race condition sometimes caused address reference counting leak. Consequently, attempts to unregister a network device failed with the following error message: “unregister_netdevice: waiting for to become free. Usage count = 1”. With this update, the underlying source code has been fixed, and network devices now unregister as expected in the described situation.”

I already deployed this fix in all systems of our PaaS pool, and there’s been already 2 days without the bug being hit. Earlier, we’ve had at least one system being frozen per day. I will report here if we hit the bug again.

The last notice (July 2018) advising users to 👍 / subscribe if they don’t have helpful information to contribute regarding crashes, is no longer visible.

I think this comment I’ve put together summarizes the issue well enough for anyone that wants to dig through it. I also have the impression that many of the past discussions have resolved the issue experienced by the majority, along with reproductions shared not creating the failure or log message.

I would suggest closing / locking this issue. 9 years to this day since it was opened. It would be better to create a new issue for anyone else still affected to follow, and where more recent / relevant information can be tracked?

System details

This is not a kernel crash report, but from an attempt to go through over 600 items of this issue, looking for any useful information and reproductions (especially reproductions confirmed by multiple users). Reproduction was not possible.

Info requested:

  • Kernel: 6.2.0
  • Distribution: Ubuntu 23.04 (Vultr)
  • Network: Bridge with IPv4 + IPv6 IP addresses assigned to enp1s0
    • NOTE: enp1s0 is created and configured via cloud-init => netplan => systemd-networkd, each veth docker creates for container start/stop triggers a cloud-init udev rule that recreates enp1s0 needlessly which can reset/undo kernel settings like proxy_ndp breaking IPv6 GUA address on other containers. This may have affected some users that reported.
daemon.json
{
  "userland-proxy": false,
  "experimental": true,
  "ipv6": true,
  "fixed-cidr-v6": "fd00:feed:face:f001::/64"
}
docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.4
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.17.3
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 343
  Running: 50
  Paused: 0
  Stopped: 293
 Images: 5
 Server Version: 23.0.5
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 2806fc1057397dbaeefbea0e4e17bddfbd388f38
 runc version: v1.1.5-0-gf19387a
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.2.0-19-generic
 Operating System: Ubuntu 23.04
 OSType: linux
 Architecture: x86_64
 CPUs: 1
 Total Memory: 952.3MiB
 Name: ipv6-test
 ID: 6027f905-4930-4f94-bdad-1587cbcdf0ef
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Experimental: true
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Reproductions shared

None of these were reproducible for me. userland-proxy: false and IPv6 docker bridge enabled.

Presumably the issue has been resolved since (as some hint at, like the docker-stress comments, and kernel commits cited in early 2019).

Cherry-picked comments

Many other comments cited CentOS or similar systems with very dated kernels most of the time, or did not provide much helpful information. Another bulk appeared to be related to IPv6, and some with UDP / conntrack.

Various fixes related to networking (for IPv6 and UDP) have been made in both Docker and the kernel over this duration. Activity within the issue has also decreased significantly, implying the main causes have been resolved.

This issue has been solved by this commit : https://github.com/torvalds/linux/commit/ee60ad219f5c7c4fb2f047f88037770063ef785f Using kpatch

curl -SOL https://raw.githubusercontent.com/Aleishus/kdt/master/kpatchs/route.patch
kpatch-build -t vmlinux route.patch 
mkdir -p /var/lib/kpatch/${UNAME} 
cp -a livepatch-route.ko /var/lib/kpatch/${UNAME}
systemctl restart kpatch
kpatch list

This issue still happening 😦 no update/ideas on how to fix?

@FrankYu thats not helpful. To participate usefully in this thread, please provide an exact way to reproduce this issue, and please test on a modern kernel. 3.10 was released four years ago, we are discussing about whether it is fixed or partially on a release from four days ago.

I’m running 4.12 (from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12/) and I still hit this, so torvalds/linux@d747a7a must not be the complete fix.

$ uname -r
4.12.0-041200-generic

Since upgrading from kernel 4.4.0 to 4.15.0 and docker 1.11.2 to 18.09 the issue disappeared.

In a sizeable fleet of VMs acting as docker hosts we had this issue appearing multiple times a day (with our Docker use-case). 45 days in and we are no longer seeing this.

For posterity, a stack-trace of a hung Docker 1.11.2 w/ printk’s showing unregister_netdevice: waiting for vethXXXXX (similar to what we were always seeing in our fleet, in hundreds of VMs) can be found at http://paste.ubuntu.com/p/6RgkpX352J/ (the interesting Container ref is 0xc820001980)

goroutine 8809 [syscall, 542 minutes, locked to thread]:
syscall.Syscall6(0x2c, 0xd, 0xc822f3d200, 0x20, 0x0, 0xc822f3d1d4, 0xc, 0x20, 0xc82435fda0, 0x10)
	/usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
syscall.sendto(0xd, 0xc822f3d200, 0x20, 0x20, 0x0, 0xc822f3d1d4, 0xc80000000c, 0x0, 0x0)
	/usr/local/go/src/syscall/zsyscall_linux_amd64.go:1729 +0x8c
syscall.Sendto(0xd, 0xc822f3d200, 0x20, 0x20, 0x0, 0x7faba31bded8, 0xc822f3d1c8, 0x0, 0x0)
	/usr/local/go/src/syscall/syscall_unix.go:258 +0xaf
github.com/vishvananda/netlink/nl.(*NetlinkSocket).Send(0xc822f3d1c0, 0xc82435fda0, 0x0, 0x0)
	/usr/src/docker/vendor/src/github.com/vishvananda/netlink/nl/nl_linux.go:333 +0xd4
github.com/vishvananda/netlink/nl.(*NetlinkRequest).Execute(0xc82435fda0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/src/docker/vendor/src/github.com/vishvananda/netlink/nl/nl_linux.go:215 +0x111
github.com/vishvananda/netlink.LinkDel(0x7fab9c2b15d8, 0xc825ef2240, 0x0, 0x0)
	/usr/src/docker/vendor/src/github.com/vishvananda/netlink/link_linux.go:615 +0x16b
github.com/docker/libnetwork/drivers/bridge.(*driver).DeleteEndpoint(0xc8204aac30, 0xc8203ae780, 0x40, 0xc826e7b800, 0x40, 0x0, 0x0)
	/usr/src/docker/vendor/src/github.com/docker/libnetwork/drivers/bridge/bridge.go:1060 +0x5cf
github.com/docker/libnetwork.(*endpoint).deleteEndpoint(0xc822945b00, 0xc82001ac00, 0x0, 0x0)
	/usr/src/docker/vendor/src/github.com/docker/libnetwork/endpoint.go:760 +0x261
github.com/docker/libnetwork.(*endpoint).Delete(0xc822945b00, 0x7fab9c2b0a00, 0x0, 0x0)
	/usr/src/docker/vendor/src/github.com/docker/libnetwork/endpoint.go:735 +0xbcb
github.com/docker/libnetwork.(*sandbox).delete(0xc8226bc780, 0xc8229f0600, 0x0, 0x0)
	/usr/src/docker/vendor/src/github.com/docker/libnetwork/sandbox.go:217 +0xd3f
github.com/docker/libnetwork.(*sandbox).Delete(0xc8226bc780, 0x0, 0x0)
	/usr/src/docker/vendor/src/github.com/docker/libnetwork/sandbox.go:175 +0x32
github.com/docker/docker/daemon.(*Daemon).releaseNetwork(0xc820001980, 0xc820e23a40)
	/usr/src/docker/.gopath/src/github.com/docker/docker/daemon/container_operations.go:732 +0x4f1
github.com/docker/docker/daemon.(*Daemon).Cleanup(0xc820001980, 0xc820e23a40)
	/usr/src/docker/.gopath/src/github.com/docker/docker/daemon/start.go:163 +0x62
github.com/docker/docker/daemon.(*Daemon).StateChanged(0xc820001980, 0xc825f9fac0, 0x40, 0xc824155b50, 0x4, 0x8900000000, 0x0, 0x0, 0x0, 0x0, ...)
	/usr/src/docker/.gopath/src/github.com/docker/docker/daemon/monitor.go:39 +0x60a
github.com/docker/docker/libcontainerd.(*container).handleEvent.func2()
	/usr/src/docker/.gopath/src/github.com/docker/docker/libcontainerd/container_linux.go:177 +0xa5
github.com/docker/docker/libcontainerd.(*queue).append.func1(0xc820073c01, 0xc820f9a2a0, 0xc821f3de20, 0xc822ddf9e0)
	/usr/src/docker/.gopath/src/github.com/docker/docker/libcontainerd/queue_linux.go:26 +0x47
created by github.com/docker/docker/libcontainerd.(*queue).append
	/usr/src/docker/.gopath/src/github.com/docker/docker/libcontainerd/queue_linux.go:28 +0x1da

From that we can observe that it hanged in https://github.com/moby/moby/blob/v1.11.2/daemon/container_operations.go#L732

which points us to https://github.com/moby/moby/blob/v1.11.2/vendor/src/github.com/docker/libnetwork/sandbox.go#L175

And https://github.com/moby/moby/blob/v1.11.2/vendor/src/github.com/docker/libnetwork/endpoint.go#L760

Which goes into libnetwork bridge driver (check the awesome description) https://github.com/moby/moby/blob/v1.11.2/vendor/src/github.com/docker/libnetwork/drivers/bridge/bridge.go#L1057-L1061

Moving to netlink https://github.com/moby/moby/blob/v1.11.2/vendor/src/github.com/vishvananda/netlink/link_linux.go#L601-L617 https://github.com/moby/moby/blob/v1.11.2//vendor/src/github.com/vishvananda/netlink/nl/nl_linux.go#L215

And ultimately in that netlink socket, calls https://github.com/moby/moby/blob/v1.11.2/vendor/src/github.com/vishvananda/netlink/nl/nl_linux.go#L333

We feel that the bug in general happens when stopping a container and due to SKBs being still referenced in the netns the veth is not released, then Docker issues a Kill to that container after 15s. Docker daemon does not handle this situation gracefully, but ultimately the bug is in the kernel. We believe that https://github.com/torvalds/linux/commit/4ee806d51176ba7b8ff1efd81f271d7252e03a1d (accepted in 4.15 upstream) and commits linked to it (there are several) act as a mitigation.

In general, that part of the kernel is not a pretty place.

I can confirm it also happens on:

Linux 3.10.0-693.17.1.el7.x86_64 Red Hat Enterprise Linux Server release 7.4 (Maipo)

I can reproduce it by doing “service docker restart” while having a certain amount of load.

Same on CentOS7 kernel 4.16.0-1.el7.elrepo.x86_64 and docker 18.03.0-ce

It worked for weeks before the crash and when to try to up, it completely stuck.

The problem also happened with kernel 3.10.0-693.21.1.el7

Thats all well and good but what exactly are the options that help? This problem is causing us issues in production so I’d like to do whatever work arounds that are necessary to work around the kernel bug.

For any of those interested, we (Travis CI) are rolling out an upgrade to v4.8.7 on Ubuntu 14.04. Our load tests showed no occurrences of the error described here. Previously, we were running linux-image-generic-lts-xenial on Ubuntu 14.04. I’m planning to get a blog post published in the near future describing more of the details.

UPDATE: I should have mentioned that we are running this docker stack:

Client:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   6b644ec
 Built:        Wed Oct 26 21:44:32 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   6b644ec
 Built:        Wed Oct 26 21:44:32 2016
 OS/Arch:      linux/amd64

UPDATE: We are still seeing this error in production on Ubuntu Trusty + kernel v4.8.7. We don’t yet know why these errors disappeared in staging load tests that previously reproduced the error, yet the error rate in production is effectively the same. Onward and upward. We have disabled “automatic implosion” based on this error given the high rate of instance turnover.

We’ve reproduced the same bug using a diagnostic kernel that had delays artificially inserted to make PMTU discovery exception routes hit this window.

  1. Debugging kernel patch:
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a0163c5..6b9e7ee 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -133,6 +133,8 @@

 static int ip_min_valid_pmtu __read_mostly	= IPV4_MIN_MTU;

+static int ref_leak_test;
+
/*
  *	Interface to generic destination cache.
  */
@@ -1599,6 +1601,9 @@ static void ip_del_fnhe(struct fib_nh *nh, __be32 daddr)
 	fnhe = rcu_dereference_protected(*fnhe_p, lockdep_is_held(&fnhe_lock));
	while (fnhe) {
 		if (fnhe->fnhe_daddr == daddr) {
+			if (ref_leak_test)
+				pr_info("XXX pid: %d, %s: fib_nh:%p, fnhe:%p, daddr:%x\n",
+					current->pid,  __func__, nh, fnhe, daddr);
 			rcu_assign_pointer(*fnhe_p, rcu_dereference_protected(
 				fnhe->fnhe_next, lockdep_is_held(&fnhe_lock)));
 			fnhe_flush_routes(fnhe);
@@ -2145,10 +2150,14 @@ static struct rtable *__mkroute_output(const struct fib_result *res,

		fnhe = find_exception(nh, fl4->daddr);
 		if (fnhe) {
+			if (ref_leak_test)
+				pr_info("XXX pid: %d, found fnhe :%p\n", current->pid, fnhe);
 			prth = &fnhe->fnhe_rth_output;
 			rth = rcu_dereference(*prth);
 			if (rth && rth->dst.expires &&
`			    time_after(jiffies, rth->dst.expires)) {
+				if (ref_leak_test)
+					pr_info("eXX pid: %d, del fnhe :%p\n", current->pid, fnhe);
				ip_del_fnhe(nh, fl4->daddr);
 				fnhe = NULL;
 			} else {
@@ -2204,6 +2213,14 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 #endif
 	}

+	if (fnhe && ref_leak_test) {
+		unsigned long  time_out;
+
+		time_out = jiffies + ref_leak_test;
+		while (time_before(jiffies, time_out))
+			cpu_relax();
+		pr_info("XXX pid: %d, reuse fnhe :%p\n", current->pid, fnhe);
+	}
 	rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0);
 	if (lwtunnel_output_redirect(rth->dst.lwtstate))
 		rth->dst.output = lwtunnel_output;
@@ -2733,6 +2750,13 @@ static int ipv4_sysctl_rtcache_flush(struct ctl_table *__ctl, int write,
		.proc_handler	= proc_dointvec,
	},
 	{
+		.procname	= "ref_leak_test",
+		.data		= &ref_leak_test,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
		.procname	= "max_size",
		.data		= &ip_rt_max_size,
 		.maxlen		= sizeof(int),
  1. User mode script:

ref_leak_test_begin.sh:

#!/bin/bash

# constructing a basic network with netns
# client <-->gateway <--> server
ip netns add svr
ip netns add gw
ip netns add cli

ip netns exec gw sysctl net.ipv4.ip_forward=1

ip link add svr-veth type veth peer name svrgw-veth
ip link add cli-veth type veth peer name cligw-veth

ip link set svr-veth netns svr
ip link set svrgw-veth netns gw
ip link set cligw-veth netns gw
ip link set cli-veth netns cli

ip netns exec svr ifconfig svr-veth 192.168.123.1
ip netns exec gw ifconfig svrgw-veth 192.168.123.254
ip netns exec gw ifconfig cligw-veth 10.0.123.254
ip netns exec cli ifconfig cli-veth 10.0.123.1

ip netns exec cli route add default gw 10.0.123.254
ip netns exec svr route add default gw 192.168.123.254

# constructing concurrently accessed scenes with nerperf
nohup ip netns exec svr  netserver -L 192.168.123.1

nohup ip netns exec cli  netperf -H 192.168.123.1 -l 300 &
nohup ip netns exec cli  netperf -H 192.168.123.1 -l 300 &
nohup ip netns exec cli  netperf -H 192.168.123.1 -l 300 &
nohup ip netns exec cli  netperf -H 192.168.123.1 -l 300 &

# Add delay
echo 3000 > /proc/sys/net/ipv4/route/ref_leak_test

# making PMTU discovery exception routes
echo 1 >  /proc/sys/net/ipv4/route/mtu_expires
for((i=1;i<=60;i++));
do
  for j in 1400  1300 1100 1000
  do
	echo "set mtu to "$j;
	ip netns exec svr ifconfig  svr-veth  mtu $j;
	ip netns exec cli ifconfig  cli-veth  mtu $j;
	ip netns exec gw ifconfig svrgw-veth  mtu $j;
	ip netns exec gw ifconfig cligw-veth  mtu $j;
	sleep 2;
  done
done

ref_leak_test_end.sh:

#!/bin/bash

echo 0 > /proc/sys/net/ipv4/route/ref_leak_test

pkill netserver
pkill netperf

ip netns exec cli ifconfig cli-veth down
ip netns exec gw ifconfig svrgw-veth down
ip netns exec gw ifconfig cligw-veth down
ip netns exec svr ifconfig svr-veth down

ip netns del svr
ip netns del gw
ip netns del cli

The test process:

  • first load the debug kernel,
  • then run ref_leak_test_begin.sh,
  • wait a few seconds, run ref_leak_test_end.sh,
  • and finally you can observe the error.
[root@iZuf6h1kfgutxc3el68z2lZ test]# bash ref_leak_test_begin.sh
net.ipv4.ip_forward = 1
nohup: ignoring input and appending output to ‘nohup.out’
nohup: set mtu to 1400
appending output to ‘nohup.out’
nohup: appending output to ‘nohup.out’
nohup: appending output to ‘nohup.out’
nohup: appending output to ‘nohup.out’
set mtu to 1300
set mtu to 1100
set mtu to 1000
set mtu to 1400
set mtu to 1300
set mtu to 1100
^C
[root@iZuf6h1kfgutxc3el68z2lZ test]# bash ref_leak_test_end.sh
[root@iZuf6h1kfgutxc3el68z2lZ test]#
Message from syslogd@iZuf6h1kfgutxc3el68z2lZ at Nov  4 20:29:43 ...
 kernel:unregister_netdevice: waiting for cli-veth to become free. Usage count = 1

After some testing, torvalds/linux@ee60ad2 can indeed fix this bug.

Everyone, @etlweather is spot-on. Only post a “me too” if you have a reliable way of reproducing the issue. In that case, detail your procedure. A docker and kernel version isn’t enough and we get lots of notifications about it. The simpler your reproduction procedure, the better.

@rneugeba @redbaron Unfortunately the current “repro” I have is very hardware specific (all but proving this is a race condition). I haven’t tried getting a QEMU repro but that’s definitely the next step so multiple people can actually work on this and get the expected result (ideally in 1 CPU core setup). If someone already has one, please shoot me an email (it’s on my profile). I’ll thoroughly test it and post it here.

FYI: we’ve been running Linux 4.8.8 in conjunction with Docker v1.12.3 on a single production host. Uptime is presently at 5.5 days and the machine remains stable.

We occasionally see a handful of unregister_netdevice: waiting for lo to become free. Usage count = 1 messages in syslog, but unlike before, the kernel does not crash and the message goes away. I suspect that one of the other changes introduced either in the Kernel or in Docker detect this condition and now recover from it. For us, this now makes this message annoying but no longer a critical bug.

I’m hoping some other folks can confirm the above on their production fleets.

@gtirloni - can you clarify if your 4.8.8/1.12.3 machine crashed or if you just saw the message?

Thank you, in advance, to everyone who has been working on reproducing/providing useful information to triangulate this thing.

@pumba-lt We had this issue about 1.5yrs ago, about 1yr ago I disabled ipv6 at the kernel level (not sysctl) and haven’t had the issue once. Running a cluster of 48 blades.

Normally in: /etc/default/grub GRUB_CMDLINE_LINUX="xxxxx ipv6.disable=1"

However, I use PXE boot, so my PXE config has:

      DEFAULT menu.c32
      prompt 0
      timeout 50
      MENU TITLE PXE Boot
      label coreos
              menu label CoreOS
              kernel mykernel
              append initrd=myimage ipv6.disable=1 elevator=deadline cloud-config-url=myurl

I assure you, you will not see this issue again.

This information might be relevant.

We are able to reproduce the problem with unregister_netdevice: waiting for lo to become free. Usage count = 1 with 4.14.0-rc3 kernel with CONFIG_PREEMPT_NONE=y and running only on one CPU with following boot kernel options:

BOOT_IMAGE=/boot/vmlinuz-4.14.0-rc3 root=/dev/mapper/vg0-root ro quiet vsyscall=emulate nosmp

Once we hit this state, it stays in this state and reboot is needed. No more containers can be spawned. We reproduce it by running images doing ipsec/openvpn connections + downloading a small file inside the tunnels. Then the instances exist (usually they run < 10s). We run 10s of such containers a minute on one machine. With the abovementioned settings (only 1cpu), the machine hits it in ~2 hours.

Another reproducer with the same kernel, but without limiting number of CPUs, is to jus run iperf in UDP mode for 3 seconds inside the container (so there is no TCP communication at all). If we run 10 of such containers in parallel, wait for all of them to finish and do it again, we hit the trouble in less than 10 minutes (on 40 cores machine).

In both of our reproducers, we added “ip route flush table all; ifconfig <iface> down; sleep 10” before existing from containers. It does not seem to have any effect.

I’ve got same issues. Docker version 1.13.1, build 092cba3 Linux debian 4.8.6-x86_64-linode78

AFAICT, this is a locking issue in the network namespaces subsystem of Linux kernel. This bug has been reported over a year ago, with no reply: https://bugzilla.kernel.org/show_bug.cgi?id=97811 There has been some work on this (see here: http://www.spinics.net/lists/netdev/msg351337.html) but it seems it’s not a complete fix.

I’ve tried pinging the network subsystem maintainer directly, with no response. FWIW, I can reproduce the issue in a matter of minutes.

We got a huge pain with this issue on our CoreOS cluster. Could anyone tell when it will be finally fixed? We dream about this moment when we can sleep at night.

Just wait for a few months and someone will come up complaining about the 4.19 kernel too. Just history repeating itself.

@thaJeztah, perhaps you should add your comment to the top of the original post, as people are still ignoring it.

Hi,

Just to add to the fire we are also seeing this problem, as requested here are the following…

Kernel Version: Linux exe-v3-worker 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u5 (2017-09-19) x86_64 GNU/Linux

Linux distribution/version: Debian 9.1 (with all packages up to date)

Are you on the latest kernel version of your Linux vendor? Yes

Network setup (bridge, overlay, IPv4, IPv6, etc): IPv4 only, NATed as per default Docker setup

Description of the workload (what type of containers, what type of network load, etc): Very short lived containers (from a few seconds to a few minutes) running scripts before exiting.

And ideally a simple reproduction:

**kernel:[617624.412100] unregister_netdevice: waiting for lo to become free. Usage count = 1

Couldn’t kill old container or start new ones on the nodes affected, had to reboot to restore functionality.**

Hopefully we find a root cause / patch soon.

Best Regards,

robputt796

The netdev@vger thread is here https://www.mail-archive.com/netdev@vger.kernel.org/msg179703.html if anyone wants to follow progress.

@piec yes, that’s expected.

I also hit this on centos 7.3 with host kernel 3.10.0-514.6.1.el7.x86_64 and docker-ce-17.06.0.ce-1.el7.centos.x86_64.

  1. start container (an internal tornado-based web service – im trying to extract out a minimal example that still hits this)
  2. make a request to web service running in container
  3. wait for response
  4. kill container

I spent some time stripping the container down and it turns out that the web service had nothing to do with the bug. What seems to trigger this in my case is mounting an NFS share inside a container (running with --privileged).

On my desktop, i can reliably reproduce simply running the following a few times:

$ docker run -it --rm --privileged alpine:latest /bin/mount -o nolock -o async -o vers=3 <my-nfs-server>:/export/foo /mnt

@rn I managed to narrow this down to a specific container in our test suite, and was able to reproduce with the following steps:

  1. start container (an internal tornado-based web service – im trying to extract out a minimal example that still hits this)
  2. make a request to web service running in container
  3. wait for response
  4. kill container

After 3 or 4 iterations of this i end up getting waiting for lo to become free and on the next iteration docker run fails with docker: Error response from daemon: containerd: container did not start before the specified timeout.

do your containers have a lot of network activity? If so, which direction is predominant?

A pretty small amount. In the steps mentioned above, the http request is a small amount of json, and the response is a binary blob thats around 10MB.

What sort of machine are you running this one (number of cores, is it a VM, etc)

This is on a 4-core desktop machine (no VM)

Do you create a lot of containers concurrently?

No, everything is done serially.

Do your containers exit normally or do they crash?

They’re stopped with docker stop

I’ve seen this issue within hours on high load systems with and without --userland-proxy=false

centos 7.2 still has this problem: kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

You can try to upgrade to the latest 4.19+ kernel.

We used kernel version:4.14.62 for a few months ,this issue disappeared.

@rn

If a user of network device never decrements the reference count, some other part of the kernel will determine that the task waiting for the cleanup is stuck and it will crash. It is only this crash which indicates a kernel bug (some user, via some code path, did not decrement the reference count). There have been several such bugs and they have been fixed in modern kernel (and possibly back ported to older ones). I have written quite a few stress tests (and continue writing them) to trigger such crashes but have not been able to reproduce on modern kernels (i do however the above message).

Please only report on this issue if your kernel actually crashes …

We are having a slightly different issue in our environment that I am hoping to get some clarification on (kernel 3.16.0-77-generic, Ubuntu 14.04, docker 1.12.3-0~trusty. We have thousands of hosts running docker, 2-3 containers per host, and we are seeing this on < 1% of total hosts running docker).

We actually never see the kernel crash, but instead (like the original reporters as far as I can tell) the dockerd process is defunct. Upstart (using the /etc/init/docker.conf job from the upstream package) will not start a new dockerd process because it thinks it is already running (start: Job is already running: docker), and attempting to stop the upstart job also fails (docker start/killed, process <pid of defunct process>).

$ ps -ely
S   UID   PID  PPID  C PRI  NI   RSS    SZ WCHAN  TTY          TIME CMD
...
Z     0 28107     1  0  80   0     0     0 -      ?        00:18:05 dockerd <defunct>

Since we mostly run with bridge networking (on a custom bridge device) in dmesg we see a slightly different message referring to the virtual interface:

[7895942.484851] unregister_netdevice: waiting for vethb40dfbc to become free. Usage count = 1
[7895952.564852] unregister_netdevice: waiting for vethb40dfbc to become free. Usage count = 1
[7895962.656984] unregister_netdevice: waiting for vethb40dfbc to become free. Usage count = 1

Because upstart seems to refuse to restart dockerd or recognize that the previously running process is a zombie, the only solution we have found is to reboot the host.

While our outcome seems different (the kernel does not crash) the root cause sounds the same or similar. Is this not the same issue then? Is there any known workaround or way to have the docker upstart job become runnable again when this occurs?

Does anyone have a suggested work around for this? I tried enabling --userland-proxy=true and docker still hangs after a while. It appears Kubernates has a solution from what @thockin wrote above, but its not clear what --hairpin-mode=promiscuous-bridge exactly does and how to configure that on a plain jane ubuntu 16.x docker install.

I think I know why some dockerized nodejs app could cause this issue. Node uses keep-alive connections per default. When server.close() is used, the server doesn’t accept new connections. But current active connections like websockets or HTTP keep-alive connections are still maintained. When the dockerized app is also scaled to n this could result in waiting for lo to become free because when it is forced to termination lo newer was freed. When docker redistributes this app to another node or the app is scaled down docker sends a signal to the app that it should shutdown. The app listens to this signal and can react. When the app isn’t shutdown after some seconds, docker terminates it without hesitation. I added signal handlers and found out that when using server.close() the server isn’t perfectly terminated but “only” stops accepting new connections (see https://github.com/nodejs/node/issues/2642). So we need to make sure that open connections like websockets or http keep-alive is also closed.

How to handle websockets: The nodejs app emits to all websockets closeSockets when a shutdown signal is received. The client listens on this closeSockets event and calls sockets.disconnect() and shortly after sockets.connect(). Remember that server.close() was called so this instance doesn’t accept new requests. When other instances of this dockerized app is running the loadbalancer inside docker will eventually pick an instance which isn’t shutdown and a successful connection is established. The instance which should shutdown won’t have open websockets-connections.

var gracefulTermination = function(){
	//we don't want to kill everything without telling the clients that this instance stops
	//server.close() sets the server to a state on which he doesn't allow new connections
	//but the old connections (websockets) are still open and can be used
	server.close(function(){
		// this method is called when the server terminates
	    console.log('close bknd');
            process.exit();
        });
	
        //iterate through all open websockets and emit 'closeSockets' to the clients.
	//Clients will then call disconnect() and connect() on their site to establish new connections
	//to other instances of this scaled app
	Object.keys(server.socketIoObj.sockets.sockets).forEach(function(id) {
		console.log("WebSocket ID:",id, " will be closed from the client.") 
		server.socketIoObj.to(id).emit('closeSockets');
	});
	
};
process.on( "SIGINT", function() {
	console.log('CLOSING [SIGINT]');
	gracefulTermination();
});
...

How to handle keep-alive HTTP connections: Currently I don’t know how this can be done perfectly. The easiest way is to disable keep-alive.

app.use(function (req, res, next) {
	res.setHeader('Connection', 'close');
	next();
}

Another possibility is to set the keep-alive timeout to a very low number. For example 0.5 seconds.

app.use(function (req, res, next) {
	res.setTimeout(500);
	next();
}

Hope this could help others 😃

I have a bugzilla ticket open with Redhat about this.

Some developments: Red Hat put the IPV6 refcount leak patches from mainline on QA, looks like they’re queued up for RHEL 7.4 and may be backported to 7.3. Should be on CentOS-plus soon too. Note: This patch only fixes issues in SOME cases. If you have a 4.x kernel it’s a moot point since they’re already there.

This is definitely a race condition in the kernel from what I can tell, which makes it really annoying to find. I’ve taken a snapshot of the current mainline kernel and am working on instrumenting the various calls starting with the IPV6 subsystem. The issue is definitely reproducible now: looks like all you have to do is create a bunch of containers, push a ton of network traffic from them, crash the program inside the containers, and remove them. Doing this over and over triggers the issue in minutes, tops on a physical 4-core workstation.

Unfortunately, I don’t have a lot of time to work on this: if there are kernel developers here who are willing to collaborate on instrumenting the necessary pieces I think we can set up a fork and start work on hunting this down step by step.

I have 9 docker hosts all nearly identical, and only experience this on some of them. It may be coincidence, but one thing in common I’ve noticed is that I only seem to have this problem when running containers that do not handle SIGINT. When I docker stop these containers, it hangs for 10s and then kills the container ungracefully.

It takes several days before the issue presents itself, and seems to show up randomly, not just imediately after running docker stop. This is mostly anecdotal, but maybe it will help someone.

I think my colleague have fixed this recently http://www.spinics.net/lists/netdev/msg393441.html, we encountered this problem in our environment and then we found the issue, with this fix, we never encounter this problem any more. Anyone who encountered this problem, could your try this patch and see if it happen again. And from our analysis, it related to ipv6, so you also can try disable ipv6 of docker with --ipv6=false when starting docker daemon

…sensible information.

😮

Same as the above comments, also running on EC2 happens to be via elastic beanstalk using 64bit Amazon Linux 2016.03 v2.1.0 running Docker 1.9.1

This is still happening for me on kernel Ubuntu 5.8.0-41.46-generic 5.8.18

kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

Any news here?

This patch has fixed this problem:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ee60ad219f5c7c4fb2f047f88037770063ef785f

We have analyzed in the following link, and this problem could also be reproduced: https://github.com/w-simon/kernel_debug_notes/blob/master/reproduce_a_leaked_dst_entry

That message itself is not the bug; it’s the kernel crashing afterwards; https://github.com/moby/moby/issues/5618#issuecomment-407751991

centos 7.2 still has this problem: kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

Same on CentOS7.5 kernel 3.10.0-693.el7.x86_64 and docker 1.13.1

Same on Centos7 kernel 3.10.0-514.21.1.el7.x86_64 and docker 18.03.0-ce

I can confirm our issues were solved after disabling IPv6 upon boot (fron grub’s config file). Had numerous issues in a 7 node cluster, runs smoothly now.

I don’t remember where I found the solution, or did I find it myself, anyway, thanks @qrpike for suggesting this to others 😃 !!

Sorry to spoil the celebration, but we were able to reproduce the issue. We are now working with @ddstreet on it at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/ .

There is actually a much simpler repro of this issue (which, BTW is not the original issue).

This script just starts a SMB server on the host and then creates a network namespace with a veth pair, executes mount; ls; unmount in the network name space and then removes the network namespace.

apk add --no-cache iproute2 samba samba-common-tools cifs-utils

# SMB server setup
cat <<EOF > /etc/samba/smb.conf
[global]
    workgroup = WORKGROUP
    netbios name = FOO
    passdb backend = tdbsam
    security = user
    guest account = nobody
    strict locking = no
    min protocol = SMB2
[public]
    path = /share
    browsable = yes
    read only = no
    guest ok = yes
    browseable = yes
    create mask = 777
EOF
adduser -D -G nobody nobody && smbpasswd -a -n nobody
mkdir /share && chmod ugo+rwx /share && touch /share/foo
chown -R nobody.nobody /share

# Bring up a veth pair
ip link add hdev type veth peer name nsdev
ip addr add 10.0.0.1/24 dev hdev
ip link set hdev up

# Start SMB server and sleep for it to serve
smbd -D; sleep 5

# Client setup
ip netns add client-ns
ip link set nsdev netns client-ns
ip netns exec client-ns ip addr add 10.0.0.2/24 dev nsdev
ip netns exec client-ns ip link set lo up
ip netns exec client-ns ip link set nsdev up
sleep 1 # wait for the devices to come up

# Execute (mount, ls, unmount) in the network namespace and a new mount namespace
ip netns exec client-ns unshare --mount \
    /bin/sh -c 'mount.cifs //10.0.0.1/public /mnt -o vers=3.0,guest; ls /mnt; umount /mnt'

# Delete the client network namespace.
ip netns del client-ns

# Create a new network namespace
# This will stall for up to 200s
ip netns add new-netns

Note adding a simple sleep 1 after the unmount, either when executing in the namespace or before deleting the network namespace works without stalling at all when creating the new namespace. A sleep after the old namespace is deleted, does not reduce the stalling.

@piec I also tested this with your repro and a sleep 1 in the Dockerfile after the unmount and everything works as expected, no stalling, no unregister_netdev messages.

I’ll write this up now and send to netdev@vger

Here is the CentOS 7.3 workaround that fixed it for me:

yum --enablerepo=centosplus install kernel-plus
egrep ^menuentry /etc/grub2.cfg | cut -f 2 -d \’
grub2-set-default 0
reboot

Here is the patch that solves it: https://bugs.centos.org/view.php?id=12711&nbn=1

UPDATE: This turned out not to solve the problem permanently. It showed up again several hours later with the following wall message: kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

Hey all, just to be clear, all the “kubernetes workaround” does is enable promiscuous mode on the underlying bridge. You can achieve the same thing with ip link set <bridgename> promisc on using iproute2. It decreases the probability of running into the bug but may not eliminate it altogether.

Now, in theory this shouldn’t work… but for some reason promiscuous mode seems to make the device teardown just slow enough that you don’t get a race to decrement the ref counter. Perhaps one of the Kurbernetes contributors can chime in here if they’re on this thread.

I can verify the workaround (NOT FIX) works using my environment-specific repro. I can’t really verify it helps if you’re using the IPVLAN or MACVLAN drivers (we use macvlan in prod) because it seems very difficult to get those setups to produce this bug. Can anyone else with a repro attempt to verify the workaround?

I saw the same kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1 message on my CentOS 7 machine running a docker image of Jenkins. The CentOS 7 machine I was using was current with all the latest CentOS 7 patches as of approximately 20 Dec 2016.

Since the most recent references here seem to be CentOS based, I’ll switch my execution host to a Ubuntu or a Debian machine.

I am running Docker version 1.12.5, build 7392c3b on that CentOS 7 machine. Docker did not hang, but the Jenkins process I was running in Docker was killed when that message appeared.

Thanks so much for Docker! I use it all the time, and am deeply grateful for your work on it!

I am actually seeing this on Amazon Linux in an ECS cluster - the message occasionally throws but it doesn’t lock up, like reshen’s seeing now. Docker 1.11.2. Uname reports “4.4.14-24.50.amzn1.x86_64” as the version.

Would be really helpful if those of you who can reproduce the issue reliably in an environment where a crashdump is possible (aka not EC2) could in fact share this crashdump file, more information about how to enable kdump in ubuntu trusty can be found here and these are the crash options you need to enable when kdump is ready to generate a crashdump:

echo 1 > /proc/sys/kernel/hung_task_panic          # panic when hung task is detected
echo 1 > /proc/sys/kernel/panic_on_io_nmi          # panic on NMIs from I/O
echo 1 > /proc/sys/kernel/panic_on_oops            # panic on oops or kernel bug detection
echo 1 > /proc/sys/kernel/panic_on_unrecovered_nmi # panic on NMIs from memory or unknown
echo 1 > /proc/sys/kernel/softlockup_panic         # panic when soft lockups are detected
echo 1 > /proc/sys/vm/panic_on_oom                 # panic when out-of-memory happens

The crashdump can really help kernel developers find more about what is causing the reference leak but keep in mind that a crashdump also includes a memory dump of your host and may contain sensible information.

Yes, this issue happens on bare metal for any kernel >= 4.3. Have seen this on a lot of different machines and hardware configurations. Only solution for us was to use kernel 4.2.

I have a “one liner” that will eventually reproduce this issue for me on an EC2 (m4.large) running CoreOS 1068.3.0 with the 4.6.3 kernel (so very recent). For me, it takes about 300 iterations but YMMV.

Linux ip-172-31-58-11.ec2.internal 4.6.3-coreos #2 SMP Sat Jun 25 00:59:14 UTC 2016 x86_64 Intel® Xeon® CPU E5-2676 v3 @ 2.40GHz GenuineIntel GNU/Linux CoreOS beta (1068.3.0) Docker version 1.10.3, build 3cd164c

A few hundred iterations of the loop here will eventually hang dockerd and the kernel will be emitting error messages like

kernel: unregister_netdevice: waiting for veth8c7d525 to become free. Usage count = 1

The reproducer loop is

i=0; while echo $i && docker run --rm -p 8080 busybox /bin/true && docker ps; do sleep 0.05; ((i+=1)); done

EDITS

  • I’ve only been able to reproduce with this when userland-proxy=false
  • I’ve NOT been able to reproduce using VirtualBox (only ec2 so far) so maybe it’s related to hypervisor too

I’m getting this on Ubuntu 14.10 running a 3.18.1. Kernel log shows

Dec 21 22:49:31 inotmac kernel: [15225.866600] unregister_netdevice: waiting for lo to become free. Usage count = 2
Dec 21 22:49:40 inotmac kernel: [15235.179263] INFO: task docker:19599 blocked for more than 120 seconds.
Dec 21 22:49:40 inotmac kernel: [15235.179268]       Tainted: G           OE  3.18.1-031801-generic #201412170637
Dec 21 22:49:40 inotmac kernel: [15235.179269] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 21 22:49:40 inotmac kernel: [15235.179271] docker          D 0000000000000001     0 19599      1 0x00000000
Dec 21 22:49:40 inotmac kernel: [15235.179275]  ffff8802082abcc0 0000000000000086 ffff880235c3b700 00000000ffffffff
Dec 21 22:49:40 inotmac kernel: [15235.179277]  ffff8802082abfd8 0000000000013640 ffff8800288f2300 0000000000013640
Dec 21 22:49:40 inotmac kernel: [15235.179280]  ffff880232cf0000 ffff8801a467c600 ffffffff81f9d4b8 ffffffff81cd9c60
Dec 21 22:49:40 inotmac kernel: [15235.179282] Call Trace:
Dec 21 22:49:40 inotmac kernel: [15235.179289]  [<ffffffff817af549>] schedule+0x29/0x70
Dec 21 22:49:40 inotmac kernel: [15235.179292]  [<ffffffff817af88e>] schedule_preempt_disabled+0xe/0x10
Dec 21 22:49:40 inotmac kernel: [15235.179296]  [<ffffffff817b1545>] __mutex_lock_slowpath+0x95/0x100
Dec 21 22:49:40 inotmac kernel: [15235.179299]  [<ffffffff8168d5c9>] ? copy_net_ns+0x69/0x150
Dec 21 22:49:40 inotmac kernel: [15235.179302]  [<ffffffff817b15d3>] mutex_lock+0x23/0x37
Dec 21 22:49:40 inotmac kernel: [15235.179305]  [<ffffffff8168d5f8>] copy_net_ns+0x98/0x150
Dec 21 22:49:40 inotmac kernel: [15235.179308]  [<ffffffff810941f1>] create_new_namespaces+0x101/0x1b0
Dec 21 22:49:40 inotmac kernel: [15235.179311]  [<ffffffff8109432b>] copy_namespaces+0x8b/0xa0
Dec 21 22:49:40 inotmac kernel: [15235.179315]  [<ffffffff81073458>] copy_process.part.28+0x828/0xed0
Dec 21 22:49:40 inotmac kernel: [15235.179318]  [<ffffffff811f157f>] ? get_empty_filp+0xcf/0x1c0
Dec 21 22:49:40 inotmac kernel: [15235.179320]  [<ffffffff81073b80>] copy_process+0x80/0x90
Dec 21 22:49:40 inotmac kernel: [15235.179323]  [<ffffffff81073ca2>] do_fork+0x62/0x280
Dec 21 22:49:40 inotmac kernel: [15235.179326]  [<ffffffff8120cfc0>] ? get_unused_fd_flags+0x30/0x40
Dec 21 22:49:40 inotmac kernel: [15235.179329]  [<ffffffff8120d028>] ? __fd_install+0x58/0x70
Dec 21 22:49:40 inotmac kernel: [15235.179331]  [<ffffffff81073f46>] SyS_clone+0x16/0x20
Dec 21 22:49:40 inotmac kernel: [15235.179334]  [<ffffffff817b3ab9>] stub_clone+0x69/0x90
Dec 21 22:49:40 inotmac kernel: [15235.179336]  [<ffffffff817b376d>] ? system_call_fastpath+0x16/0x1b
Dec 21 22:49:41 inotmac kernel: [15235.950976] unregister_netdevice: waiting for lo to become free. Usage count = 2
Dec 21 22:49:51 inotmac kernel: [15246.059346] unregister_netdevice: waiting for lo to become free. Usage count = 2

I’ll send docker version/info once the system isn’t frozen anymore 😃

I finally found out how to suppress these messages btw. From this question on StackExchange, I commented out this line in /etc/rsyslog.conf:

# Everybody gets emergency messages
#*.emerg                    :omusrmsg:*

Very nuclear option, but at least now my system is usable again!

@Beatlor RHEL did not help us at all. A stable production environment is more important then some worthless support contract. We are still running very stable now on 4.17.11. No big issues anymore.

Yes, I also did not have this problem after upgrading the kernel to 4.17.0-1.el7.elrepo.x86_64. I tried this before (4.4.x, 4.8, 4.14…) and it has failed. It seems that the problem will not occur again in the 4.17+ kernel.

For what its worth…we upgraded RHEL Linux kernel from 3.10.0 to 4.17.11. (Running Kubernetes cluster on). Before upgrading this bug was occuring on daily basis several times on different servers. Running with the upgrade for three weeks now. Bug occurred only once. So roughly said is reduced by 99%.

Are you guys running docker under any limits? Like ulimits, cgroups etc…

newer systemd has a default limit even if you didnt set it. I set things to unlimited and the issue hasn’t occurred ever since (watching since 31 days).

Ok, this bug still occurs, but probability has reduced.

I think if the containers are gracefully stops (PID 1 exist()s), then this bug will not bother us.

$ docker network ls
NETWORK ID          NAME                     DRIVER              SCOPE
b3fc47abfff2        bridge                   bridge              local
f9474559ede8        dockerfile_cluster_net   bridge              local
ef999de68a96        host                     host                local
e7b41d23674c        none                     null                local
$ docker network rm f9474559ede8 

fixed it.

same issue on CentOS 7

  • kernel: 4.11.1-1.el7.elrepo.x86_64
  • docker: 17.12.0-ce

Centos 7.5 with 4.17.3-1 kernel, still got the issue.

Env : kubernetes 1.10.4 Docker 13.1 with Flannel network plugin.

Log : [ 89.790907] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [ 89.798523] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 89.799623] cni0: port 8(vethb8a93c6f) entered blocking state [ 89.800547] cni0: port 8(vethb8a93c6f) entered disabled state [ 89.801471] device vethb8a93c6f entered promiscuous mode [ 89.802323] cni0: port 8(vethb8a93c6f) entered blocking state [ 89.803200] cni0: port 8(vethb8a93c6f) entered forwarding state

kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1。

Now : The node IP can reach, but cannot use any network services , like ssh…

@bcdonadio I can reproduce the bug on my system by running this test script once per hour from cron:

#!/bin/sh

TAG=`date +%F_%H_%M_%S_UTC`
docker pull centos:centos6
docker run --rm adrianotto/centos6 yum check-update -q > package_updates.txt
LINES=`wc -l < package_updates.txt`

if [ $LINES -eq 0 ] ; then
        rm -f package_updates.txt
        echo "No packages need to be updated"
        exit 0
fi

docker run --rm adrianotto/centos6 rpm -a -q > old_packages.txt
docker build -t temp:$TAG .
docker run --rm temp:$TAG rpm -a -q > new_packages.txt
docker rmi temp:$TAG

This script is just producing a package list using an image in the Docker registry, and another using one that’s built locally so I can compare them. The Dockerfile is just this:

FROM centos:centos6
MAINTAINER Adrian Otto
RUN yum clean all && yum update -y && yum clean all

2-4 minutes later syslog gets this message:

Message from syslogd@docker at May 27 16:51:55 ...
 kernel:unregister_netdevice: waiting for lo to become free. Usage count = 0

In the last occurrence happened a few minutes after I ran the script manually. My guess is that after some timeout elapses after the container delete is attempted, the error condition is raised.

I’m certain the error condition is intermittent, because the script above runs as a cron job at :00 past each error. Here is a sample of the error output that syslog recorded:

May 26 01:02:44 docker kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
May 26 02:02:22 docker kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
May 26 02:02:32 docker kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
May 26 03:02:18 docker kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
May 26 03:02:28 docker kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
May 26 03:02:38 docker kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
May 26 04:03:14 docker kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
May 26 05:02:25 docker kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
May 26 05:02:35 docker kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
May 26 06:03:31 docker kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
May 26 06:03:41 docker kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
May 26 06:03:51 docker kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
May 26 06:04:02 docker kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1
May 26 09:03:04 docker kernel: unregister_netdevice: waiting for lo to become free. Usage count = 1

So it happens somewhere in the range of 2 to 4 minutes after the containers run and exit and are deleted by docker because of the --rm flag. Also notice from the log above that there is not an error for every container that’s run/deleted, but it’s pretty consistent.

This is reproducible on Linux containerhost1 4.9.0-0.bpo.2-amd64 #1 SMP Debian 4.9.18-1~bpo8+1 (2017-04-10) x86_64 GNU/Linux with Docker version 17.04.0-ce, build 4845c56 running in priviliged mode when we have cifs mounts open. When the container stops with mounts open, Docker gets unresponsive and we get the kernel:[ 1129.675495] unregister_netdevice: waiting for lo to become free. Usage count = 1-error.

@stayclassychicago before I tried the 3.10.0-514.10.2.el7.centos.plus.x86_64 kernel I was getting the kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1 very regularly, nearly every time I ran a container with docker run --rm ... when the container exited. After the kernel upgrade and reboot, it completely stopped for many hours, and then came back again. Now half the time I delete containers it works properly, where it used to error very time. I don’t know for sure if the new kernel is helping, but it doesn’t hurt.

If someone from Docker has time to try the Kubernetes workaround, please let me know and we can point you at it. I am unable to extract the changes and patch them into Docker myself, right now.

On Thu, Mar 9, 2017 at 7:46 AM, Matthew Newhook notifications@github.com wrote:

Thats all well and good but what exactly are the options that help? This problem is causing us issues in production so I’d like to do whatever work arounds that are necessary to work around the kernel bug.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/docker/issues/5618#issuecomment-285388243, or mute the thread https://github.com/notifications/unsubscribe-auth/AFVgVGdH5VX_oFWkImyo_TvlIuhmwMepks5rkB7MgaJpZM4B4L4Z .

We’re getting this in GCE pretty frequently. Docker freezes and the machine hangs on reboot.

[782935.982038] unregister_netdevice: waiting for vethecf4912 to become free. Usage count = 17

The container is running a go application, and has hairpin nat configured.

Docker:

matthew@worker-1:~$ docker version
Client:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Tue Jan 10 20:38:45 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Tue Jan 10 20:38:45 2017
 OS/Arch:      linux/amd64

Ubuntu 16.04 LTS,

matthew@worker-1:~$ uname -a
Linux worker-1 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Linux backup 4.6.0-040600-generic #201606100558 SMP Fri Jun 10 10:01:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Server Version: 1.13.1

Same issue. I’m using mount in privileged container. After 4-5 runs it freezes. Also i have same issue with latest standard kernel for 16.04

@r-BenDoan if you try to stop a container but it doesn’t respond to SIGINT, docker will wait 10 seconds and then kill the container ungracefully. I encountered that behavior in my nodejs containers until I added signal handling. If you see a container taking 10s to stop, it likely isn’t handling signals and is more likely to trigger this issue.

Make sure your containers can stop gracefully.

This might not be a purely docker issue - I’m getting it on a Proxmox server where I’m only running vanilla LXC containers (ubuntu 16.04).

Still happens with 4.9.0-0.bpo.1-amd64 on debian jessie with docker 1.13.1. Is there any kernel - os combination which is stable?

Just an observation - there seem to be different problems at play (that has been said before).

  • Some people note “waiting for lo…”
  • some have noted “waiting for eth0”
  • some have noted “waiting for veth???”
  • On RedHat bug tracking, there is talk about “waiting for ppp0”

Some have noted logs alternating between any of those above and others only having one of the above only.

There is also a similar bug logged on Ubuntu. On this one, they seem to find that NFS is the problem.

I have this on two CentOS systems, latest updates on at least one of them.

$ uname -r
3.10.0-514.2.2.el7.x86_64
$ docker -v
Docker version 1.12.6, build 78d1802

@LK4D4 Oh, totally, just want to see those timeouts 😉

@RRAlex Do you know if anyone has reached out to the Ubuntu kernel team regarding a backport? We have a large production Docker deployment on an Ubuntu cluster that’s affected by the bug.

Ubuntu kernel team mailing list: https://lists.ubuntu.com/archives/kernel-team/2016-September/thread.html

Patch for the stable kernel: https://github.com/torvalds/linux/commit/751eb6b6042a596b0080967c1a529a9fe98dac1d

Ubuntu kernel commit log: http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/log/?h=master-next (Patch is not there yet)

coolljt0725 colleagues’s patch has been queued for stable, so hopefully it’ll be backported into distros soon enough. (David Miller’s post: http://www.spinics.net/lists/netdev/msg393688.html )

Not sure where that commit is though and if we should send it to Ubuntu, RH, etc. to help them track & backport it?

Going to show up here at some point I guess: http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/tree/net/ipv6/addrconf.c

EDIT: seems to be present here: https://github.com/torvalds/linux/blob/master/net/ipv6/addrconf.c

@RRAlex This is not specific to any docker version. If you are using --userland-proxy=false on the daemon options… OR (from what I understand) you are using kubernetes you will likely hit this issue.

The reason being is the --userland-proxy=false option enables hairpin NAT on the bridge interface… this is something that kubernetes also sets when it sets up the networking for it’s containers.

Just hopping on board. I’m seeing the same behavior on the latest Amazon ec2 instance. After some period of time, the container just tips over and becomes unresponsive.

$ docker info Containers: 2 Images: 31 Server Version: 1.9.1 Storage Driver: devicemapper Pool Name: docker-202:1-263705-pool Pool Blocksize: 65.54 kB Base Device Size: 107.4 GB Backing Filesystem: Data file: /dev/loop0 Metadata file: /dev/loop1 Data Space Used: 1.199 GB Data Space Total: 107.4 GB Data Space Available: 5.754 GB Metadata Space Used: 2.335 MB Metadata Space Total: 2.147 GB Metadata Space Available: 2.145 GB Udev Sync Supported: true Deferred Removal Enabled: false Deferred Deletion Enabled: false Deferred Deleted Device Count: 0 Data loop file: /var/lib/docker/devicemapper/devicemapper/data Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata Library Version: 1.02.93-RHEL7 (2015-01-28) Execution Driver: native-0.2 Logging Driver: json-file Kernel Version: 4.4.10-22.54.amzn1.x86_64 Operating System: Amazon Linux AMI 2016.03 CPUs: 1 Total Memory: 995.4 MiB Name: [redacted] ID: OB7A:Q6RX:ZRMK:4R5H:ZUQY:BBNK:BJNN:OWKS:FNU4:7NI2:AKRT:5SEP

$ docker version Client: Version: 1.9.1 API version: 1.21 Go version: go1.4.2 Git commit: a34a1d5/1.9.1 Built: OS/Arch: linux/amd64

Server: Version: 1.9.1 API version: 1.21 Go version: go1.4.2 Git commit: a34a1d5/1.9.1 Built: OS/Arch: linux/amd64

Thanks for the link, Justin! I’ll troll Linus =)

kind regards. =* ❤️

@LK4D4 I believe it is just a high amount of created/destroyed containers, specially containers doing a lot of outbound traffic, we also use LXC instead of docker but the bug is exactly the same as the one described here, I can try to reproduce using your method if it is easy to describe and/or does not involve production load, the idea is to get a crashdump and maybe find more hints about what exactly trigger this bug.

happy birthday, bloody issue =) 6 May 2014

same thing here. Just rebooting. Latest docker version. Ubuntu 14.04.

Sorry to hear. This is a real problem. I guess that many scenarios lead to this issue. On previous versions, the way we managed that was by preventing the namespace deletion.

Hi, this regression was introduced in 5.4.120, and is fixed in 5.4.121.

Hello, do you have any more info on which specific commits introduced and fixed the issue as mentioned above?

See the commits authored by Eric Dumazet: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/?h=v5.4.121

It was specifically about ip6_vti interfaces: 5.4.120 added https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v5.4.121&id=98ebeb87b2cf26663270e8e656fe599a32e4c96d which introduced the regression.

(If I remember right, same issue was seen in some other stable/LTS kernel versions as well.)

I saw this for the first time on a Gentoo system with kernel v5.4.120, just upgraded from kernel v5.4.117. Kernel sources used: sys-kernel/gentoo-sources.

I get

unregister_netdevice: waiting for ip6_vti0 to become free. Usage count = 1

every 10 seconds or so.

Hi, this regression was introduced in 5.4.120, and is fixed in 5.4.121.

Found out my Docker logs are also being spammed. Kernel 5.4.0, Docker 19.03.8:

Mar 21 18:46:14 host.mysite.com dockerd[16544]: time="2020-03-21T18:46:14.127275161Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 21 18:45:13 host.mysite.com dockerd[16544]: time="2020-03-21T18:45:13.642050333Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 21 18:44:13 host.mysite.com dockerd[16544]: time="2020-03-21T18:44:13.161364216Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 21 18:43:12 host.mysite.com dockerd[16544]: time="2020-03-21T18:43:12.714725302Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

This issue has been solved by this commit : torvalds/linux@ee60ad2 Using kpatch

curl -SOL https://raw.githubusercontent.com/Aleishus/kdt/master/kpatchs/route.patch
kpatch-build -t vmlinux route.patch 
mkdir -p /var/lib/kpatch/${UNAME} 
cp -a livepatch-route.ko /var/lib/kpatch/${UNAME}
systemctl restart kpatch
kpatch list

This must be in 4.19.30 onwards.

Friends, I have been using the 4.19 kernel for stable operation for half a year. I hope that you can enjoy stability as well.

@ethercflow @2rs2ts we are also running debian. I have encountered a lot of issues trying to get kpatch-build to work. If I manage to find a workaround I’ll keep you posted. In any case, does anybody have any other solution? Is it kernel version 4.15 or 4.19 that mitigates the problem? I have been trying to find the answer for the past week and still have not managed to.

You may upgrade to 4.19. It’s in the backports.

BTW it’s been a year for us here. 😉

Anyone saw this bug with 4.19?

Are these vanilla kernels or heavily patched by distros with backported fixes?

I’m seeing this issue with one of my machines running Docker on Debian 9 Stretch (4.9.0-8-amd64). I experience this issue with a tunnel created within the Docker container via Docker Gen and it generates a kernel panic:

Message from syslogd@xxxx at Apr 29 15:55:41 ...
 kernel:[719739.507961] unregister_netdevice: waiting for tf-xxxxxxxx to become free. Usage count = 1

Here’s our Docker information:

Client:
 Version:           18.09.3
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        774a1f4
 Built:             Thu Feb 28 06:34:04 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.3
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       774a1f4
  Built:            Thu Feb 28 05:59:55 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Does anybody know if there’s a temporary fix to this without restarting the entire machine? We’d really prefer not having to restart the entire machine when we experience this issue.

Somewhat off-topic, we cannot suppress the kernel panic messages within the terminal as well. I’ve tried dmesg -D and dmesg -n 1. However, no luck. Is there a way to suppress these type of kernel panic messages from within the terminal? It’s annoying trying to type commands and having that message pop up every 10 seconds or so.

Thanks.

@dElogics What do you mean by “host namespace”? Is it simply --privileged?

No, it means --network=host

@victorgp We still experience the issue with the 4.15 kernel. We will report here when we have tested with 4.16 kernel (hopefully in a few weeks).

Did someone have this issue in a kernel 4.15 or newer?

This Dan Streetman fix (https://github.com/torvalds/linux/commit/4ee806d51176ba7b8ff1efd81f271d7252e03a1d) is first included in 4.15 kernel version and it seems that at least for someone it is not happening anymore since they upgraded to 4.16 (https://github.com/kubernetes/kubernetes/issues/64743#issuecomment-436839647)

Did someone try it out?

@dElogics thanks for letting us know, could you please show us what commands you ran to set this systemd limits to unlimited. I like to try that too.

@hzbd You mean delete the user-defined bridge network? Have your tried to dig further to find out why? Please let me know if you did that. I really appreciate that.

The docker daemon. The systemd as in Debian 9 (232-25).

Not sure about RHEL, but I’ve personally seen this issue on RHEL too. I’d set LimitNOFILE=1048576, LimitNPROC=infinity, LimitCORE=infinity, TasksMax=infinity

@dElogics What version of systemd is considered “newer”? Is this default limit enabled in the CentOS 7.5 systemd?

Also, when you ask if we’re running docker under any limits, do you mean the docker daemon, or the individual containers?

we have been confused by this issue for weeks . Linux 3.10.0-693.17.1.el7.x86_64 CentOS Linux release 7.4.1708 (Core)

Experienced same issue on CentOS. My kernel is 3.10.0-693.17.1.el7.x86_64. But, I didn’t get similar stack trace in syslog.

@fho thanks. You actually don’t need docker at all to repro, just running the samba client in a network namespace will do the trick as per https://github.com/moby/moby/issues/5618#issuecomment-318681443

@rn thanks for the investigations!

I’m running on a baremetal desktop PC so I have access to everything. It’s an i7-4790K + 32 GiB. Currently I’m running on an up-to-date Arch Linux + kernel from the testing repo (4.12.3-1-ARCH)

In my case everything behaves as you describe in your Experiment 2 (4.11.12 kernel):

  • After the client-smb container exists running new containers is impossible for 4+ minutes.
  • I never have kernel crashes
  • The unregister_netdevice: waiting for lo to become free. Usage count = 1 message appears repeatedly if I try to run any new container in the 4+ minutes delay after the client-smb has exited. And only appears if you run a new container in that 4 minutes time lapse. Running a new container after these 4 minutes will be “normal”

So I suppose there’s an issue somewhere in the clean up process of the smb-client container related to network interfaces

Hi,

I’ve just created 2 repos https://github.com/piec/docker-samba-loop and https://github.com/piec/docker-nfs-loop that contain the necessary setup in order to reproduce this bug

My results:

  • 4.11.3-1-ARCH (Arch Linux) kernel: I generate the bug with docker-samba-loop in a few iterations (<10). I can’t reproduce it with docker-nfs-loop
  • 4.11.9-1-ARCH same results (versions)
  • 4.12.3-1-ARCH (testing repo) same results
  • 4.11.11-coreos: same results for docker-samba-loop, didn’t try docker-nfs-loop

Hope this helps Cheers

@drweber you will also find this patch in upcoming stable releases (for now 4.11.12, 4.9.39, 4.4.78, 3.18.62)

I have kernel version 3.10.0-514.21.1.el7.x86_64, and I still have the same symptom.

Message from syslogd@docker at May 26 22:02:26 ...
 kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1
# uname -a
Linux docker 3.10.0-514.21.1.el7.x86_64 #1 SMP Thu May 25 17:04:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
# uptime
 22:03:10 up 35 min,  3 users,  load average: 0.16, 0.07, 0.06

I can make this happen reliably when running Proxmox and using containers. Specifically, if I have moved a considerable amount of data or moved really any amount of data very recently, shutting down or hard stopping the container will produce this error. I’ve seen it most often when I am using containers that mount my NAS within, but that might be a coincidence.

# uname -a
Linux proxmox01 4.4.40-1-pve #1 SMP PVE 4.4.40-82 (Thu, 23 Feb 2017 15:14:06 +0100) x86_64 GNU/Linux

# cat /etc/debian_version
8.7

And from within Proxmox:

proxmox-ve: 4.4-82 (running kernel: 4.4.40-1-pve)
pve-manager: 4.4-12 (running version: 4.4-12/e71b7a74)
pve-kernel-4.4.35-1-pve: 4.4.35-77
pve-kernel-4.4.40-1-pve: 4.4.40-82
lvm2: 2.02.116-pve3
corosync-pve: 2.4.2-1
libqb0: 1.0-1
pve-cluster: 4.0-48
qemu-server: 4.0-109
pve-firmware: 1.1-10
libpve-common-perl: 4.0-92
libpve-access-control: 4.0-23
libpve-storage-perl: 4.0-76
pve-libspice-server1: 0.12.8-2
vncterm: 1.3-1
pve-docs: 4.4-3
pve-qemu-kvm: 2.7.1-4
pve-container: 1.0-94
pve-firewall: 2.0-33
pve-ha-manager: 1.0-40
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u3
lxc-pve: 2.0.7-3
lxcfs: 2.0.6-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
smartmontools: 6.5+svn4324-1~pve80
zfsutils: 0.6.5.9-pve15~bpo80

It’s worth noting that Docker is not installed on this system and never has been. I’m happy to provide any data the community needs to troubleshoot this issue, just tell me what commands to run.

FWIW kubernetes has correlated this completely to veth “hairpin mode” and has stopped using that feature completely. We have not experienced this problem at all, across tens of thousands of production machines and vastly more test runs, since changing.

Until this is fixed, abandon ship. Find a different solution 😦

On Wed, Feb 15, 2017 at 10:00 AM, ETL notifications@github.com wrote:

While I’m not the one who is fixing this issue, not being much into Linux Kernel dev, I think I am right in saying that the “me too” comments aren’t that helpful. By this I mean, just saying “I have this problem too, with Kernel vx.x and Docker 1.x” does not bring anything new to the discussion.

However, I would suggest that “me too” comments which describe more the environment and method to reproduce would be of great value.

When reading all the comments, it is clear that there are a few problems - as I posted earlier, some with vethXYZ, some with eth0 and others with lo0. This suggest that they could be caused by different problems. So just saying “me too” without full description of the error and environment may mislead people.

Also, when describing the environment, giving the Kernel and Docker version is not sufficient. Per the thread, there seems to be a few factors such as ipv6 enabled or not. NodeJS not responding to SIGINT (or other containers, not bashing on NodeJS here).

So describing what the workload on the environment is would be useful. Also, this occurs when a container is being shutdown, therefore I would also suggest to the people experiencing this issue to pay attention to what container is being stopped when the problem rear its ugly head.

While it seems the problem is in the Kernel having a race condition - identifying the trigger will be of tremendous help to those who will fix the issue. And it can even give the affected users an immediate solution such as implementing a signal handler in a NodeJS application (I don’t know myself that this prevents the issue from triggering, but it seems so per earlier comments of others).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/docker/docker/issues/5618#issuecomment-280087293, or mute the thread https://github.com/notifications/unsubscribe-auth/AFVgVFmu1SiStZcLKtKuk1W-tjn6wOXlks5rcz0hgaJpZM4B4L4Z .

Yes upgrading kernel on Proxmox to latest 4.9.9 didn’t resolve the error. Strange as it’s just appeared after a year without issues.

There might be something in a previous statement further up in the thread about it being linked to either NFS or CIFS shares mounted.

Sent from my iPhone

On 14 Feb 2017, at 07:47, Alfonso da Silva notifications@github.com wrote:

I’m getting it running Docker 1.13.1 on a Debian with kernel 4.9.9-040909.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

@jsoler right, I was doing that too, still happened. Only once I did it pxe level did it stop.

label coreos
        menu label CoreOS
        kernel coreos/coreos_production_pxe.vmlinuz
        append initrd=coreos/coreos_production_pxe_image.cpio.gz ipv6.disable=1 cloud-config-url=http://...

I have/had been dealing with this bug for almost a year now. I use CoreOS with PXE boot, I disabled ipv6 in the pxeboot config and I haven’t seen this issue once since then.

Hey, for everyone affected by this issue on RHEL or CentOS, I’ve backported the commit from the mainline kernels (torvalds/linux@751eb6b6042a596b0080967c1a529a9fe98dac1d) that fixes the race condition in the IPV6 IFP refcount to 3.10.x kernels used in enterprise distributions. This should fix this issue.

You can find the bug report with working patch here: If you are interested in testing it and have a RHEL 7 or CentOS 7 system, I have already compiled the latest CentOS 7.3 3.10.0-514.6.1.el7.x86_64 kernel with the patch. Reply to the CentOS bugtracker thread and I can send you a link to the build.

Note: there may be another issue causing a refcount leak but this should fix the error message for many of you.

@stefanlasiewski @henryiii @jsoler

I’ll be trying out a build also adding this fix: http://www.spinics.net/lists/netdev/msg351337.html later tonight.

Did anyone , reach a fix procedure for rancher os ?

@cpuguy83 Docker doesnt hang for me when this error occurs but the containers get killed which in my situation is breaking my Jenkins/CI jobs.

Getting this issue on Cent OS 7:

kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

Linux foo 3.10.0-514.2.2.el7.x86_64 #1 SMP Tue Dec 6 23:06:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

docker-engine-1.12.5-1.el7.centos.x86_64

This is effecting my CI builds which are running inside Docker containers and appear to be suddenly dying during which this console message appears. Is there a fix or a workaround? thanks!

1.12.4 and 1.13 should, in theory, not freeze up when this kernel issue is hit. The reason the freeze in the docker daemon occurs is because the daemon is waiting for a netlink message back from the kernel (which will never come) while holding the lock on the container object.

1.12.4 and 1.13 set a timeout on this netlink request to at least release the container lock. This does not fix the issue, but at least (hopefully) does not freeze the whole daemon. You will likely not be able to spin up new containers, and similarly probably will not be able to tear them down since it seems like all interactions with netlink stall once this issue is hit.

@reshen I’m going to build 4.8.8 this weekend on my laptop and see if that fixes it for me! ᐧ

On Thu, Dec 1, 2016 at 10:29 AM, Ernest Mueller notifications@github.com wrote:

I am actually seeing this on Amazon Linux in an ECS cluster - the message occasionally throws but it doesn’t lock up, like reshen’s seeing now. Docker 1.11.2. Uname reports “4.4.14-24.50.amzn1.x86_64” as the version.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/docker/docker/issues/5618#issuecomment-264220432, or mute the thread https://github.com/notifications/unsubscribe-auth/AKklVRqoBUZDu3HMhGv3b6knnA6j6C_Qks5rDvXRgaJpZM4B4L4Z .

– Keifer Furzland http://kfrz.work

Ubuntu 16.04 with 4.4.0-47 was still affected… trying 4.4.0-49 now, will report later.

edit: 2016-11-28: -49 is sitll showing that log line in dmesg.

@reshen Excellent research. So far I’ve also not been able to reproduce the problem using Linux 4.8.8 on Xubuntu 16.04.

I’ve been using the Ubuntu mainline kernel builds. I do not have a well defined test case, but I could consistently reproduce the problem before by starting and stopping the set of docker containers I work with.

To test Linux 4.8.8 the easiest for me was to switch from aufs to overlay2 as storage driver as the mainline kernel builds did not include aufs. I don’t think will influence the test, but it should be noted.

In the past I’ve tested Linux 4.4.4 with the 751eb6b6 backported by Dan Streetman, this did not seem to reduce the problem for me. It will be interesting to see if also backporting the two patches noted by you (5086cadf and 6fff1319) can give the same result as 4.4.8.

@reshen I’ll get us updated to 4.8.8 and report back 👍 Thanks much for your research!

I’ve been testing 4.8.8 in a tight loop (see [2] from my earlier comment for the test case) non-stop for the last 4 days. So far, so good.

Facts

  • Patch 751eb6b6 significantly reduces the occurrence of this problem, but does not entirely eliminate it.

Suppositions @meatballhat pointed out that their production servers experienced the problem while running 4.8.7. This leaves us with two possibilities:

  • My test test case is faulty (the more likely possibility)
  • 4.8.8 introduced a fix. Looking at the 4.8.8 changelog, commits 5086cadf and 6fff1319 both mention netdev issues that were resolved. Neither one explicitly calls out the problem here, but both are close enough to be suspicious.

Can we get a few folks to try 4.8.8 to see if they are able to reproduce this problem?

centos7 kernel 4.4.30 again~~~~

how to fix the bug, I meet it every day

same… Fedora 24, happen randomly, can be fine for week, than i get one every 10 hours kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

@thaJeztah ah, I see the question you were directing me at now.

So the patch is in the upstream 4.4 stable queue, for 16.04 it’s likely to be included in not the next kernel SRU (which is already in progress) but the one after that, about 5-6 weeks from now. If it is needed in 14.04 too please let me know so that it can be backported.

@rdallman Deactivating ipv6 via grub does not prevent unregister_netdevice for me in either ubuntu 16.06 (kernel 4.4.0-36-generic) or 14.04 (kernel 3.13.0-95-generic). Regardless of the --userland-proxy setting (either true or false).

@dadux Thank you for your help. On Kubernetes 1.3.4 CoreOS 1068 Stable, Docker 10.3, Flannel as networking layer, I have fixed the problem by making the following changes in my CoreOS units:

    - name: docker.service
      drop-ins:
        - name: 30-Set-Promiscuous-Mode.conf
          content: |
            [Service]
            ExecStartPost=/usr/bin/sleep 5
            ExecStartPost=/usr/bin/ip link set docker0 promisc on

Added the following to kubelet.service:

        --hairpin-mode=none

If your docker daemon uses the docker0 bridge, setting --hairpin-mode=promiscuous-bridge will have no effect as the kubelet will try to configure the un-existing bridge cbr0.

For CoreOS, my workaround to mirror the Kubernetes behaviour but still using flannel :

  • Add a systemd drop-in for docker.service to configure the bridge docker0 interface to promiscuous mode. (Surely there’s a more elegant want to do this ?) :
- name: docker.service
  command: start
  drop-ins:
   - name: 30-Set-Promiscuous-Mode.conf
     content: |
       [Service]
       ExecStartPost=/usr/bin/sleep 5
       ExecStartPost=/usr/bin/ip link set docker0 promisc on
  • Tell the kubelet not to set hairpin in the docker bridge: kubelet --hairpin-mode=none

You can check if hairpin is enabled for your interfaces with brctl showstp docker0 or for f in /sys/devices/virtual/net/*/brport/hairpin_mode; do cat $f; done

Happens on CoreOS alpha 1097.0.0 also.

Kernel: 4.6.3 Docker: 1.11.2

I can confirm that moving away from hairpin-veth and using the cbr0 bridge I cannot reproduce the problem anymore.

@edevil I use terraform to create routes. You can find it at this repo. I have quickly created this configuration and tested only once. I hope it is enough to provide basic logic behind it.

We have encountered this bug while using Flannel with hairpin-veth enabled at kubernetes cluster (using iptables proxy). This bug was happening only when we run-stop too many container. We switch to using cbr0 bridge network and promiscuous-bridge hairpin mode and never see it again. Actually it is easy to reproduce this bug if you are using hairpin-veth, just start this job with 100 containers with kubernetes.

This problem does not correlate with the filesystem in any way. I have see this problem with zfs, overlayfs, devicemapper, btrfs and aufs. Also with or without swap. It is not even limited to docker, I hit the same bug with lxc too. The only workaround, I currently see, is not to stop container concurrently.

@sirlatrom not fixed. Seeing this again 😭 Required multiple reboots to resolve.

Currently running 3.19.0-18-generic. Will try upgrading to latest

I encountered this again on CoreOS 1010.1.0 with kernel 4.5.0 yesterday, it had been after several containers were started and killed in rapid succession.

@joshrendek it’s a kernel bug. Looks like even newly released kernel 4.4 does not fix it, so there is at least one more race condition somewhere 😃