moby: Kernel warning "cache_from_obj: Wrong slab cache." on CentOS 7.4 with overlay2 storage driver

Description

Steps to reproduce the issue:

  1. Setup docker 17.10.0-ce on CentOS 7.4 with overlay2 storage driver.
  2. Create/remove many containers (In my attached log, it occurs after 2593 container creations/deletions)
  3. After a few hours, kernel warning occurs and after the first warning, many warning messages (around 30 per container) are generated occasionally when creating a container.

Describe the results you received:

Nov  6 14:52:02 hostname kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(2602:4291dc5a69218ff1387fa53b80a7e89c92c0ad2b8307d78de372cd555269d1af)
Nov  6 14:52:02 hostname kernel: ------------[ cut here ]------------
Nov  6 14:52:02 hostname kernel: WARNING: CPU: 11 PID: 267 at mm/slab.h:249 kmem_cache_free+0x195/0x200
Nov  6 14:52:02 hostname kernel: Modules linked in: xt_nat veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink xt_addrtype br_netfilter overlay(T) ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw iTCO_wdt gf128mul glue_helper iTCO_vendor_support ablk_helper cryptd ipmi_ssif dcdbas iomemory_vsl(POE)
Nov  6 14:52:02 hostname kernel: pcspkr sg mei_me mei lpc_ich shpchp ipmi_si wmi ipmi_devintf ipmi_msghandler acpi_power_meter tpm_crb ip_tables xfs libcrc32c sd_mod sr_mod cdrom crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crct10dif_pclmul crct10dif_common crc32c_intel megaraid_sas libata tg3 i2c_core ptp pps_core dm_mirror dm_region_hash dm_log dm_mod
Nov  6 14:52:02 hostname kernel: CPU: 11 PID: 267 Comm: kauditd Tainted: P           OE  ------------ T 3.10.0-693.5.2.el7.x86_64 #1
Nov  6 14:52:02 hostname kernel: Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 1.5.4 10/002/2015
Nov  6 14:52:02 hostname kernel: 0000000000000000 00000000f1baaffa ffff88085cf87d48 ffffffff816a3e51
Nov  6 14:52:02 hostname kernel: ffff88085cf87d88 ffffffff810879d8 000000f95cf87d68 ffff8805fdff4000
Nov  6 14:52:02 hostname kernel: ffff88017fc07700 000000000000004d ffff88105d272f70 0000000000000000
Nov  6 14:52:02 hostname kernel: Call Trace:
Nov  6 14:52:02 hostname kernel: [<ffffffff816a3e51>] dump_stack+0x19/0x1b
Nov  6 14:52:02 hostname kernel: [<ffffffff810879d8>] __warn+0xd8/0x100
Nov  6 14:52:02 hostname kernel: [<ffffffff81087b1d>] warn_slowpath_null+0x1d/0x20
Nov  6 14:52:02 hostname kernel: [<ffffffff811de905>] kmem_cache_free+0x195/0x200
Nov  6 14:52:02 hostname kernel: [<ffffffff815710d7>] kfree_skbmem+0x37/0x90
Nov  6 14:52:02 hostname kernel: [<ffffffff81573004>] consume_skb+0x34/0x80
Nov  6 14:52:02 hostname kernel: [<ffffffff81117536>] kauditd_send_skb+0x66/0x140
Nov  6 14:52:02 hostname kernel: [<ffffffff8111773f>] kauditd_thread+0xbf/0x1e0
Nov  6 14:52:02 hostname kernel: [<ffffffff810c4820>] ? wake_up_state+0x20/0x20
Nov  6 14:52:02 hostname kernel: [<ffffffff81117680>] ? audit_printk_skb+0x70/0x70
Nov  6 14:52:02 hostname kernel: [<ffffffff810b099f>] kthread+0xcf/0xe0
Nov  6 14:52:02 hostname kernel: [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40
Nov  6 14:52:02 hostname kernel: [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
Nov  6 14:52:02 hostname kernel: [<ffffffff810b08d0>] ? insert_kthread_work+0x40/0x40
Nov  6 14:52:02 hostname kernel: ---[ end trace aa2550a346590dce ]---
Nov  6 14:52:02 hostname kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(2602:4291dc5a69218ff1387fa53b80a7e89c92c0ad2b8307d78de372cd555269d1af)
Nov  6 14:52:02 hostname kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(2602:4291dc5a69218ff1387fa53b80a7e89c92c0ad2b8307d78de372cd555269d1af)
Nov  6 14:52:02 hostname kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(2602:4291dc5a69218ff1387fa53b80a7e89c92c0ad2b8307d78de372cd555269d1af)
Nov  6 14:52:02 hostname kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(2602:4291dc5a69218ff1387fa53b80a7e89c92c0ad2b8307d78de372cd555269d1af)
Nov  6 14:52:02 hostname kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(2602:4291dc5a69218ff1387fa53b80a7e89c92c0ad2b8307d78de372cd555269d1af)
Nov  6 14:52:02 hostname kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(2602:4291dc5a69218ff1387fa53b80a7e89c92c0ad2b8307d78de372cd555269d1af)
Nov  6 14:52:02 hostname kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(2602:4291dc5a69218ff1387fa53b80a7e89c92c0ad2b8307d78de372cd555269d1af)
...

Describe the results you expected:

No kernel warnings.

Additional information you deem important (e.g. issue happens only occasionally):

I’ve not seen this error when I was using overlay storage driver, so it seems like to be caused by overlay2 storage driver.

The time for the first kernel warning is varying, but it is always reproduced.

Output of docker version:

Client:
 Version:      17.10.0-ce
 API version:  1.33
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:04:05 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.10.0-ce
 API version:  1.33 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   f4ffd25
 Built:        Tue Oct 17 19:05:38 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 560
Server Version: 17.10.0-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 0351df1c5a66838d0c392b4ac4cf9450de844e2d
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-693.5.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 62.52GiB
Name: *hostname*
ID: LBBE:G6PD:GBS4:LNDS:S4GD:7NG5:HESD:6GDV:G3JX:FTOF:MMF7:4SRG
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

Physical machine (Dell PowerEdge R630)

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 3
  • Comments: 25 (6 by maintainers)

Most upvoted comments

This is an extremely bad news, so basically docker is just useless in production heavy environments with any Kernel below 4.x

Currently my docker does not last more then a day after it just dies because of this issue.

I confirm that I resolved that problem on all my servers by using kernel 4.x (4.15 in my case).

rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org rpm --httpproxy wx1.no.cg.lab.nms.mlb.inet --httpport 3128 -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm yum --enablerepo=elrepo-kernel install kernel-lt

To enable the kernel at boot automatically: awk -F' /^menuentry/{print$2} /etc/grub2.cfg Find 4.4.xxx entry in the list grub2-set-default ” <kernel name you want to boot with>”

@robertofabrizi Despite reservations about running it in prod, we’ve been slowly updating nodes to kernel-lt (not kernel-ml) and have not seen the issue since. Currently running: 4.4.140-1.el7.elrepo.x86_64