harvester: [BUG] Frequent kernel panics occurring during operation

Using Harvester 0.3.0-rc1 nodes are randomly rebooting/crashing.

The following trace can be found in the kernel logs shortly before the automatic reboot (due to panic=10) occurs:

[ 8258.424256] ------------[ cut here ]------------
[ 8258.424258] rq->tmp_alone_branch != &rq->leaf_cfs_rq_list
[ 8258.424281] WARNING: CPU: 33 PID: 0 at ../kernel/sched/fair.c:378 enqueue_task_fair+0x353/0x610
[ 8258.424283] Modules linked in: binfmt_misc rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm ib_core ebt_ip ebtable_broute ebtables vhost_net vhost tun tap ipt_rpfilter xt_set xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net ip_set vxlan ip6_udp_tunnel udp_tunnel veth nf_conntrack_netlink nfnetlink xt_addrtype xt_recent xt_statistic xt_nat ipt_REJECT xt_tcpudp iptable_mangle ip6table_mangle ip6table_filter ip6table_nat ip6_tables xt_MASQUERADE xt_conntrack xt_comment iptable_filter xt_mark bpfilter iptable_nat ip_tables nf_nat x_tables nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc af_packet bonding iscsi_ibft rfkill intel_rapl_msr intel_rapl_common ipmi_ssif isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iTCO_wdt mgag200 intel_pmc_bxt iTCO_vendor_support kvm drm_kms_helper dell_smbios dcdbas(X) mei_me cec rc_core ipmi_si syscopyarea sysfillrect sysimgblt irqbypass pcspkr dell_wmi_descriptor wmi_bmof joydev mei
[ 8258.424350]  i2c_i801 lpc_ich fb_sys_fops ipmi_devintf ipmi_msghandler button drm fuse configfs overlay loop hid_generic usbhid ext4 crc16 mbcache jbd2 sd_mod crc32_pclmul crc32c_intel ghash_clmulni_intel xhci_pci xhci_hcd aesni_intel i40e crypto_simd usbcore cryptd ahci glue_helper libahci nvme igb libata nvme_core megaraid_sas t10_pi i2c_algo_bit dca wmi sunrpc dm_mirror dm_region_hash dm_log be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod
[ 8258.424401] Supported: Yes, External
[ 8258.424406] CPU: 33 PID: 0 Comm: swapper/33 Tainted: G          I    X    5.3.18-59.24-default #1 SLE15-SP3
[ 8258.424408] Hardware name: Dell Inc. PowerEdge R740xd/06WXJT, BIOS 2.8.2 08/27/2020
[ 8258.424416] RIP: 0010:enqueue_task_fair+0x353/0x610
[ 8258.424420] Code: 60 09 00 00 0f 84 cc fd ff ff 80 3d 9a b8 51 01 00 0f 85 bf fd ff ff 48 c7 c7 98 56 f3 ad c6 05 86 b8 51 01 01 e8 3d 0c fc ff <0f> 0b e9 a5 fd ff ff 49 63 95 48 0a 00 00 48 c7 c0 40 94 01 00 48
[ 8258.424423] RSP: 0018:ffffb8174037be40 EFLAGS: 00010086
[ 8258.424425] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 8258.424428] RDX: 000000000000002d RSI: ffffffffaeaecd6d RDI: 0000000000000046
[ 8258.424430] RBP: ffffa1677f62cd00 R08: ffffffffaeaecd40 R09: 000000000002c500
[ 8258.424432] R10: ffffb8174037bdc0 R11: 0000000000000000 R12: 0000000000000000
[ 8258.424433] R13: ffffa1677f62cc80 R14: 0000000000000001 R15: 0000000000000000
[ 8258.424436] FS:  0000000000000000(0000) GS:ffffa1677f600000(0000) knlGS:0000000000000000
[ 8258.424438] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8258.424440] CR2: 000000c000feb000 CR3: 000000be081c6002 CR4: 00000000007706e0
[ 8258.424442] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8258.424444] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8258.424446] PKRU: 55555554
[ 8258.424447] Call Trace:
[ 8258.424459]  ttwu_do_activate+0x72/0x170
[ 8258.424464]  sched_ttwu_pending+0xa5/0x110
[ 8258.424470]  do_idle+0x166/0x270
[ 8258.424476]  cpu_startup_entry+0x19/0x20
[ 8258.424481]  start_secondary+0x155/0x1a0
[ 8258.424489]  secondary_startup_64_no_verify+0xc2/0xd0
[ 8258.424494] ---[ end trace 5cd94bd1dde862f3 ]---

Same or similar issue has been reported for CoreOS (Fedora Kernel).

HW: Dell PowerEdge R740xd

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 49 (21 by maintainers)

Most upvoted comments

@bk201 Can you help to build a Harvester iso today with the kernel from https://github.com/harvester/harvester/issues/1342#issuecomment-933789767 ?

Also, since it’s a kernel debug build, it’s better if we can include kdump into the build as well. Ref: https://github.com/harvester/harvester/issues/1342#issuecomment-932309680

The master build ISO has the kernel updated, which should address this issue.

I’m getting the same random crashes with the official 0.3.0 release… running kernel : 5.3.18-59.24-default hardware: dell poweredge R740xd I’ll try to update the kernel later on to see if it fix my issue…

Got another crash with the hotfix kernel. Uploaded the dump harvester-vmcore-1342.tar to the suse upload server.

[ 8995.095798] BUG: kernel NULL pointer dereference, address: 0000000000000080
[ 9016.281685] NMI watchdog: Watchdog detected hard LOCKUP on cpu 21
[ 9016.281686] Modules linked in: ebt_ip ebtable_broute ebtables vhost_net vhost tun tap xt_statistic rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm ib_core xt_recent nf_conntrack_netlink vxlan ip6_udp_tunnel udp_tunnel ipt_rpfilter xt_set xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net ip_set nfnetlink veth xt_addrtype xt_nat ipt_REJECT xt_tcpudp iptable_mangle ip6table_mangle ip6table_filter ip6table_nat ip6_tables xt_MASQUERADE xt_conntrack xt_comment iptable_filter xt_mark bpfilter iptable_nat ip_tables nf_nat x_tables nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc af_packet bonding iscsi_ibft rfkill intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iTCO_wdt intel_pmc_bxt ipmi_ssif iTCO_vendor_support kvm mgag200 drm_kms_helper cec dell_smbios rc_core mei_me dcdbas(X) syscopyarea sysfillrect sysimgblt dell_wmi_descriptor irqbypass wmi_bmof pcspkr mei i2c_i801 fb_sys_fops
[ 9016.281704]  lpc_ich ipmi_si ipmi_devintf ipmi_msghandler button drm fuse configfs overlay loop ext4 crc16 mbcache jbd2 sd_mod crc32_pclmul crc32c_intel ghash_clmulni_intel xhci_pci xhci_hcd aesni_intel i40e crypto_simd cryptd ahci glue_helper nvme libahci igb usbcore nvme_core libata megaraid_sas t10_pi i2c_algo_bit dca wmi sunrpc dm_mirror dm_region_hash dm_log be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod
[ 9016.281715] Supported: No, Unreleased kernel
[ 9016.281716] CPU: 21 PID: 49287 Comm: systemd-udevd Kdump: loaded Tainted: G        W I    X    5.3.18-59.27-default #1 SLE15-SP3 (unreleased)
[ 9016.281717] Hardware name: Dell Inc. PowerEdge R740xd/06WXJT, BIOS 2.8.2 08/27/2020
[ 9016.281717] RIP: 0010:native_queued_spin_lock_slowpath+0x191/0x1e0
[ 9016.281718] Code: c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 00 d9 02 00 48 03 04 f5 a0 f9 1b bb 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 32 48 85 f6 74 94 0f 0d 0e eb 8f 8b 07
[ 9016.281718] RSP: 0018:ffffa8508af8f998 EFLAGS: 00000046
[ 9016.281718] RAX: 0000000000000000 RBX: 0000000000000082 RCX: 0000000000580000
[ 9016.281719] RDX: ffff8aeeac8ad900 RSI: 0000000000000005 RDI: ffff8a8ec082cc80
[ 9016.281719] RBP: ffff8a8ec082cc80 R08: 0000000000580000 R09: ffffffffffffffff
[ 9016.281720] R10: 0000000000000008 R11: 0000000000000000 R12: ffffa8508af8fc20
[ 9016.281720] R13: ffff8a8ec082cc80 R14: 0000000000000010 R15: 0000000000000000
[ 9016.281721] FS:  00007f64c4385980(0000) GS:ffff8aeeac880000(0000) knlGS:0000000000000000
[ 9016.281721] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9016.281721] CR2: 00007f64c3deb590 CR3: 000000be5c62c003 CR4: 00000000007706e0
[ 9016.281722] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9016.281722] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9016.281723] PKRU: 55555554
[ 9016.281723] Call Trace:
[ 9016.281723]  _raw_spin_lock_irqsave+0x30/0x40
[ 9016.281723]  update_blocked_averages+0x2d/0x530
[ 9016.281724]  update_nohz_stats+0x42/0x60
[ 9016.281724]  update_sd_lb_stats.constprop.122+0x274/0x890
[ 9016.281724]  find_busiest_group+0x41/0x380
[ 9016.281725]  load_balance+0x15a/0xc60
[ 9016.281725]  newidle_balance+0x2a5/0x3b0
[ 9016.281725]  pick_next_task_fair+0x3e/0x3a0
[ 9016.281725]  __schedule+0x18d/0x760
[ 9016.281726]  schedule+0x2f/0xa0
[ 9016.281726]  schedule_hrtimeout_range_clock+0xee/0x100
[ 9016.281726]  ? sock_write_iter+0x97/0x100
[ 9016.281727]  ? __seccomp_filter+0x7a/0x690
[ 9016.281727]  ep_poll+0x3d4/0x4d0
[ 9016.281727]  ? wait_woken+0x80/0x80
[ 9016.281727]  do_epoll_wait+0xab/0xc0
[ 9016.281728]  __x64_sys_epoll_wait+0x1a/0x20
[ 9016.281728]  do_syscall_64+0x5b/0x1e0
[ 9016.281728]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 9016.281729] RIP: 0033:0x7f64c318aff6
[ 9016.281729] Code: 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 e8 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5a f3 c3 41 55 41 54 41 89 cd 55 53 41 89 d4
[ 9016.281730] RSP: 002b:00007ffcf896e958 EFLAGS: 00000246 ORIG_RAX: 00000000000000e8
[ 9016.281730] RAX: ffffffffffffffda RBX: 000055e713942940 RCX: 00007f64c318aff6
[ 9016.281731] RDX: 0000000000000002 RSI: 000055e7138f06b0 RDI: 0000000000000003
[ 9016.281731] RBP: 0000000000000001 R08: 00007f64c34589e0 R09: 0000000000000000
[ 9016.281731] R10: 00000000ffffffff R11: 0000000000000246 R12: ffffffffffffffff
[ 9016.281732] R13: 0000000000000002 R14: 000055e7139fc720 R15: 000055e713035298
[ 9016.281732] Kernel panic - not syncing: Hard LOCKUP

Please ignore the analysis of the first oops. I was reading the wrong register dump. It’s not actually an oops and is probably another lockup report. The lines at the top of that report are from an Oops but it’s truncated.

@alexdepalex we’re working on that now. The enhanced kernel debugging ability will definitely be ready for v1.0 GA (if it misses v0.3).