kubernetes: Test failures caused by kernel NULL pointer dereference on debian-based CVM

Initially observed this in PR tests: https://github.com/kubernetes/kubernetes/pull/44326#issuecomment-299251676 Then I checked ci-kubernetes-e2e-gce-etcd3 and found that the suite failed ~3 times and day and many of them were caused by the kernel panic.

A few examples: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-etcd3/9104 https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-etcd3/9104/artifacts/bootstrap-e2e-minion-group-qcn2/serial-1.log

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-etcd3/9078 https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-etcd3/9078/artifacts/bootstrap-e2e-minion-group-wlm4/serial-1.log

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-etcd3/9068 https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-etcd3/9068/artifacts/bootstrap-e2e-minion-group-ml5v/serial-1.log

  780.070508] aufs au_opts_verify:1570:docker[19344]: dirperm1 breaks the protection by the permission bits on the lower branch
May  4 02:50:24 bootstrap-e2e-minion-group-wlm4 kernel: [  780.070508] aufs au_opts_verify:1570:docker[19344]: dirperm1 breaks the protection by the permission bits on the lower branch
[  780.111216] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
[  780.119476] IP: [<ffffffff810a1100>] check_preempt_wakeup+0xd0/0x1d0
[  780.126101] PGD 214722067 PUD 20b376067 PMD 0 
[  780.131089] Oops: 0000 [#1] SMP 
[  780.134721] Modules linked in: sg nf_conntrack_netlink nfnetlink xt_statistic sch_htb ebt_ip ebtable_filter ebtables veth ipt_REJECT xt_nat xt_recent xt_mark xt_comment xt_tcpudp ipt_MASQUERADE iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype ip_tables xt_conntrack x_tables nf_nat nf_conntrack bridge stp llc aufs(C) nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc crct10dif_pclmul crc32_pclmul crc32c_intel psmouse processor parport_pc i2c_piix4 pvpanic parport thermal_sys pcspkr serio_raw evdev i2c_core aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd virtio_net button ext4 crc16 mbcache jbd2 sd_mod crc_t10dif crct10dif_common virtio_scsi scsi_mod virtio_pci virtio virtio_ring
[  780.210840] CPU: 0 PID: 31766 Comm: exe Tainted: G         C    3.16.0-4-amd64 #1 Debian 3.16.39-1
[  780.219923] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[  780.229261] task: ffff880037bc2b60 ti: ffff880214f5c000 task.ti: ffff880214f5c000
[  780.236875] RIP: 0010:[<ffffffff810a1100>]  [<ffffffff810a1100>] check_preempt_wakeup+0xd0/0x1d0
[  780.245919] RSP: 0018:ffff880214f5fe60  EFLAGS: 00010006
[  780.251375] RAX: 0000000000000001 RBX: ffff88010053c040 RCX: 0000000000000008
[  780.258632] RDX: 0000000000000001 RSI: ffff880212d20050 RDI: ffff88021fc12fb8
[  780.265891] RBP: 0000000000000000 R08: ffffffff81610640 R09: 0000000000000001
[  780.273149] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880037bc2b60
[  780.280403] R13: ffff88021fc12f40 R14: 0000000000000000 R15: 0000000000000000
[  780.287657] FS:  0000000002826880(0063) GS:ffff88021fc00000(0000) knlGS:0000000000000000
[  780.295866] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  780.301732] CR2: 0000000000000078 CR3: 00000001a03f8000 CR4: 00000000001406f0
[  780.308996] Stack:
[  780.311128]  0000000000012f40 ffff88021fc12f40 0000000000012f40 ffff88021fc12f40
[  780.319185]  ffff880212d206d4 0000000000000246 ffff88020a8839c0 ffffffff81095bb5
[  780.327242]  ffff880212d20050 ffffffff8109869a 00007fffffffeffd 0000000000000000
[  780.335351] Call Trace:
[  780.337923]  [<ffffffff81095bb5>] ? check_preempt_curr+0x85/0xa0
[  780.344058]  [<ffffffff8109869a>] ? wake_up_new_task+0xda/0x190
[  780.350105]  [<ffffffff81067a39>] ? do_fork+0x139/0x3d0
[  780.355463]  [<ffffffff8151b139>] ? stub_clone+0x69/0x90
[  780.360910]  [<ffffffff8151adcd>] ? system_call_fast_compare_end+0x10/0x15
[  780.367910] Code: 39 c2 7d 27 0f 1f 80 00 00 00 00 83 e8 01 48 8b 5b 70 39 d0 75 f5 48 8b 7d 78 48 3b 7b 78 74 15 0f 1f 00 48 8b 6d 70 48 8b 5b 70 <48> 8b 7d 78 48 3b 7b 78 75 ee 48 85 ff 74 e9 e8 8c cb ff ff 48 
[  780.395815] RIP  [<ffffffff810a1100>] check_preempt_wakeup+0xd0/0x1d0
[  780.402514]  RSP <ffff880214f5fe60>
[  780.406122] CR2: 0000000000000078
[  780.410418] ---[ end trace 8dfc3fa423bb7378 ]---
[  780.415157] Kernel panic - not syncing: Fatal exception
[  781.471619] Shutting down cpus with NMI
[  781.476527] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[  781.486828] Rebooting in 10 seconds..
[  791.465975] ACPI MEMORY or I/O RESET_REG.

The nodes are debian-based CVM instances running docker 1.11.2. I am not sure if this is a known issue.

/cc @kubernetes/sig-node-bugs @dchen1107

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 29 (24 by maintainers)

Most upvoted comments

It will be available on next m59 and m60 release this week. Stay tuned and release note could be found at: https://cloud.google.com/container-optimized-os/docs/release-notes