v86: Kernel hang with CONFIG_DEBUG_NMI_SELFTEST (Alpine Linux)

When booting Alpine Linux in v86 (default kernel compiled with CONFIG_DEBUG_NMI_SELFTEST=y), the kernel softlocks during nmi_selftest:

[    1.350333] smpboot: weird, boot CPU (#0) not listed by the BIOS
[    1.350333] smpboot: SMP disabled
[    1.363333] Performance Events: unsupported Netburst CPU model 6 no PMU driver, software events only.
[    1.365666] rcu: Hierarchical SRCU implementation.
[    1.375666] NMI watchdog: Perf NMI watchdog permanently disabled
[    1.378999] smp: Bringing up secondary CPUs ...
[    1.380333] smp: Brought up 1 node, 1 CPU
[    1.380333] smpboot: Max logical packages: 2
[    1.380333] ----------------
[    1.380333] | NMI testsuite:
[    1.380333] --------------------
[    1.380333]   remote IPI:  ok  |
[    1.380333]    local IPI:
[   29.373333] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:1]
[   29.373666] Modules linked in:
[   29.373666] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.15.86-0-lts #1-Alpine
[   29.373666] EIP: test_nmi_ipi.constprop.0+0x99/0xe0
[   29.373666] Code: ed 87 ff 83 f8 40 75 09 89 d8 35 8e 02 f6 0e eb 1b 8d 83 1b be 94 ec 4e 74 12 b8 c7 10 00 00 81 eb 54 1a 54 50 e8 0c cb 8e ff <eb> c5 35 09 99 49 6c ba 9d 43 96 c6 89 c3 31 c0 e8 17 20 4f ff 8d
[   29.373666] EAX: 00000000 EBX: 83179bd0 ECX: 00000000 EDX: 00000000
[   29.373666] ESI: 000ed549 EDI: c6b90c04 EBP: c10f3f2c ESP: c10f3f24
[   29.373666] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00000246
[   29.373666] CR0: 80050033 CR2: ff9ba000 CR3: 06c0c000 CR4: 00000690
[   29.373666] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[   29.373666] DR6: 00000000 DR7: 00000000
[   29.373666] Call Trace:
[   29.373666]  local_ipi+0x32/0x4f
[   29.373666]  dotest.constprop.0+0x11/0xbb
[   29.373666]  nmi_selftest+0x83/0x1a3
[   29.373666]  native_smp_cpus_done+0x2a/0xc7
[   29.373666]  smp_init+0x7b/0x94
[   29.373666]  kernel_init_freeable+0x13f/0x28e
[   29.373666]  ? rest_init+0xb0/0xb0
[   29.373666]  kernel_init+0x17/0xf0
[   29.373666]  ret_from_fork+0x1c/0x30

The test is supposed to timeout, but occasionally the watchdog shows that it’s hitting udelay:

[  113.373666] EIP: delay_tsc+0x41/0xa0
[  113.373666] Code: 31 bf c6 89 45 f0 0f ae e8 0f 31 89 45 e0 89 55 e4 eb 17 8d b6 00 00 00 00 f3 90 64 8b 1d 3c 31 bf c6 39 5d f0 75 32 89 5d f0 <0f> ae e8 0f 31 8b 4d e8 8b 5d ec 89 45 d8 89 55 dc 2b 45 e0 1b 55
[  113.373666] EAX: ffffffff EBX: 00000000 ECX: 000003e9 EDX: 00000000
[  113.373666] ESI: 000d8da8 EDI: c6b90c04 EBP: c10f3f14 ESP: c10f3eec
[  113.373666] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00000246
[  113.373666] CR0: 80050033 CR2: ff9ba000 CR3: 06c0c000 CR4: 00000690
[  113.373666] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[  113.373666] DR6: 00000000 DR7: 00000000
[  113.373666] Call Trace:
[  113.373666]  __const_udelay+0x31/0x40
[  113.373666]  test_nmi_ipi.constprop.0+0x99/0xe0
[  113.373666]  local_ipi+0x32/0x4f
[  113.373666]  dotest.constprop.0+0x11/0xbb
[  113.373666]  nmi_selftest+0x83/0x1a3

In v86, IOAPIC_DELIVERY_NMI is stubbed:

https://github.com/copy/v86/blob/17a6b3b4e942ed04eb22098023d4b694f97020c0/src/apic.js#L436-L440

So it seems like a layered issue:

  • Kernel probably shouldn’t be calling nmi_selftest when SMP is disabled
  • The test timeout isn’t working, for some reason
  • IOAPIC_DELIVERY_NMI should probably be implemented (at least trivially)

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 19 (8 by maintainers)

Most upvoted comments

  • Compile a kernel without the nmi selftest

I recompiled Alpine Linux edge kernel without CONFIG_DEBUG_NMI_SELFTEST (virt: iso, package, lts: package), and it’s worked in Firefox 122.0.1 (Windows) on non-https copy.sh/v86:

1

Notes:

  • To install kernel to already installed system: apk add --allow-untrusted ./linux-<virt/lts>-6.6.16-r0.apk
  • Networking (ne2k-pci) and sound (sb16) working only on linux-lts, because linux-virt don’t have drivers
  • Instruction for building custom kernel: https://wiki.alpinelinux.org/wiki/Custom_Kernel (compiles lts and virt once together, needed installed Alpine Linux and 20GB+ storage)

It’s used roughly three places:

setTimeout might also be affected. v86 uses setTimeout instead of postMessage when the cpu is executing HLT and there are no timer interrupts coming up. If setTimeout is throttled to 16ms, it may sleep much longer than intended.

To narrow it down, I’d suggest:

  • Replacing microtick with a throttled version (function throttled_microtick() { return 16.67 * Math.round(microtick() / 16.67); }) in selected places
  • Disabling idling by commenting out setTimeout here

You can run this code to determine the resolution of performance.now(): x = new Set(); while(x.size < 5) x.add(performance.now()); Array.from(x).sort((a,b) => a-b).slice(-2).reduce((a,b) => b-a)

On Firefex (developer edition), I get 1ms on http sites and 0.02ms on https (with COOP/COEP).

The NMI test suite calls udelay 1000000 times: https://github.com/torvalds/linux/blob/28b8235/arch/x86/kernel/nmi_selftest.c#L82-L83 udelay on x86 is here: https://github.com/torvalds/linux/blob/28b8235/arch/x86/lib/delay.c#L207 udelay will repeatedly call rdtsc until it changes (some amount that should amount to 1us), but with the reduced resolution this will take (worst case) 16 minutes (1ms * 1000000) on Firefox on http and 20 seconds on https (0.02ms * 1000000). I can experimentally confirm the 20 second delay.

On Chromium, the resolution is higher (0.1ms and 0.005ms respectively). Maybe your Firefox has a different resolution or doesn’t accept the COOP/COEP headers on https://copy.sh?

Now regarding fixes, there are a couple of options:

  • Implement NMIs in the apic (so the test passes instead of timing out). Currently apic support is experimental and doesn’t boot Alpine at all. In any case, it would be nice if the timer issue could be fixed properly, as other OSes might also be affected (I believe KolibriOS has a similar loop in its boot process).
  • Do nothing and wait until Firefox gets a better timer resolution
  • Hack rdtsc to return higher values if called multiple times and the browser time hasn’t changed. There is some prototype that was enabled a long time ago (see https://github.com/copy/v86/blob/bfb59f9/src/rust/cpu/cpu.rs#L3972). However, this would mess with other uses of rdtsc. Maybe behind a flag.
  • Implement tpause (although it’s not entirely clear how, if the timer resolution sucks)
  • Change cpuid flags so that Linux stops using the tsc and uses the default delay_loop instead. In my experiments, setting cpuid_level to 0x14 makes Linux try to calibrate the tsc against the pic and fail (“Marking TSC unstable due to could not calculate TSC khz”), at least on Firefox where the timer resolution is 1ms
  • Compile a kernel without the nmi selftest

Or, since this is not really v86’s fault:

  • Ask Firefox developers to improve their timer resolution (do two browsers really need four different values?)
  • Send a patch to Alpine to disable the NMI selftest
  • Send a patch to Linux to only run the NMI selftest on systems that have an apic

When I create an image using the latest Alpine Virtual 3.19.1 it outputs a message similar to “watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:1]” from the OP over and over with the seconds number growing each time and never gets to anything usable. Maybe I missed something?

@spetterman66 also seemed to confirm that only old versions work in an earlier message…