v86: Kernel hang with CONFIG_DEBUG_NMI_SELFTEST (Alpine Linux)
When booting Alpine Linux in v86 (default kernel compiled with CONFIG_DEBUG_NMI_SELFTEST=y
), the kernel softlocks during nmi_selftest
:
[ 1.350333] smpboot: weird, boot CPU (#0) not listed by the BIOS
[ 1.350333] smpboot: SMP disabled
[ 1.363333] Performance Events: unsupported Netburst CPU model 6 no PMU driver, software events only.
[ 1.365666] rcu: Hierarchical SRCU implementation.
[ 1.375666] NMI watchdog: Perf NMI watchdog permanently disabled
[ 1.378999] smp: Bringing up secondary CPUs ...
[ 1.380333] smp: Brought up 1 node, 1 CPU
[ 1.380333] smpboot: Max logical packages: 2
[ 1.380333] ----------------
[ 1.380333] | NMI testsuite:
[ 1.380333] --------------------
[ 1.380333] remote IPI: ok |
[ 1.380333] local IPI:
[ 29.373333] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:1]
[ 29.373666] Modules linked in:
[ 29.373666] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.15.86-0-lts #1-Alpine
[ 29.373666] EIP: test_nmi_ipi.constprop.0+0x99/0xe0
[ 29.373666] Code: ed 87 ff 83 f8 40 75 09 89 d8 35 8e 02 f6 0e eb 1b 8d 83 1b be 94 ec 4e 74 12 b8 c7 10 00 00 81 eb 54 1a 54 50 e8 0c cb 8e ff <eb> c5 35 09 99 49 6c ba 9d 43 96 c6 89 c3 31 c0 e8 17 20 4f ff 8d
[ 29.373666] EAX: 00000000 EBX: 83179bd0 ECX: 00000000 EDX: 00000000
[ 29.373666] ESI: 000ed549 EDI: c6b90c04 EBP: c10f3f2c ESP: c10f3f24
[ 29.373666] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00000246
[ 29.373666] CR0: 80050033 CR2: ff9ba000 CR3: 06c0c000 CR4: 00000690
[ 29.373666] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 29.373666] DR6: 00000000 DR7: 00000000
[ 29.373666] Call Trace:
[ 29.373666] local_ipi+0x32/0x4f
[ 29.373666] dotest.constprop.0+0x11/0xbb
[ 29.373666] nmi_selftest+0x83/0x1a3
[ 29.373666] native_smp_cpus_done+0x2a/0xc7
[ 29.373666] smp_init+0x7b/0x94
[ 29.373666] kernel_init_freeable+0x13f/0x28e
[ 29.373666] ? rest_init+0xb0/0xb0
[ 29.373666] kernel_init+0x17/0xf0
[ 29.373666] ret_from_fork+0x1c/0x30
The test is supposed to timeout, but occasionally the watchdog shows that it’s hitting udelay:
[ 113.373666] EIP: delay_tsc+0x41/0xa0
[ 113.373666] Code: 31 bf c6 89 45 f0 0f ae e8 0f 31 89 45 e0 89 55 e4 eb 17 8d b6 00 00 00 00 f3 90 64 8b 1d 3c 31 bf c6 39 5d f0 75 32 89 5d f0 <0f> ae e8 0f 31 8b 4d e8 8b 5d ec 89 45 d8 89 55 dc 2b 45 e0 1b 55
[ 113.373666] EAX: ffffffff EBX: 00000000 ECX: 000003e9 EDX: 00000000
[ 113.373666] ESI: 000d8da8 EDI: c6b90c04 EBP: c10f3f14 ESP: c10f3eec
[ 113.373666] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00000246
[ 113.373666] CR0: 80050033 CR2: ff9ba000 CR3: 06c0c000 CR4: 00000690
[ 113.373666] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 113.373666] DR6: 00000000 DR7: 00000000
[ 113.373666] Call Trace:
[ 113.373666] __const_udelay+0x31/0x40
[ 113.373666] test_nmi_ipi.constprop.0+0x99/0xe0
[ 113.373666] local_ipi+0x32/0x4f
[ 113.373666] dotest.constprop.0+0x11/0xbb
[ 113.373666] nmi_selftest+0x83/0x1a3
In v86, IOAPIC_DELIVERY_NMI
is stubbed:
https://github.com/copy/v86/blob/17a6b3b4e942ed04eb22098023d4b694f97020c0/src/apic.js#L436-L440
So it seems like a layered issue:
- Kernel probably shouldn’t be calling nmi_selftest when SMP is disabled
- The test timeout isn’t working, for some reason
IOAPIC_DELIVERY_NMI
should probably be implemented (at least trivially)
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 19 (8 by maintainers)
I recompiled Alpine Linux edge kernel without
CONFIG_DEBUG_NMI_SELFTEST
(virt: iso, package, lts: package), and it’s worked in Firefox 122.0.1 (Windows) on non-https copy.sh/v86:Notes:
apk add --allow-untrusted ./linux-<virt/lts>-6.6.16-r0.apk
It’s used roughly three places:
TSC_RATE
): https://github.com/copy/v86/blob/21cf9ad/src/rust/cpu/cpu.rs#L3966setTimeout
might also be affected. v86 usessetTimeout
instead ofpostMessage
when the cpu is executing HLT and there are no timer interrupts coming up. IfsetTimeout
is throttled to 16ms, it may sleep much longer than intended.To narrow it down, I’d suggest:
microtick
with a throttled version (function throttled_microtick() { return 16.67 * Math.round(microtick() / 16.67); }
) in selected placessetTimeout
hereYou can run this code to determine the resolution of performance.now():
x = new Set(); while(x.size < 5) x.add(performance.now()); Array.from(x).sort((a,b) => a-b).slice(-2).reduce((a,b) => b-a)
On Firefex (developer edition), I get 1ms on http sites and 0.02ms on https (with COOP/COEP).
The NMI test suite calls
udelay
1000000 times: https://github.com/torvalds/linux/blob/28b8235/arch/x86/kernel/nmi_selftest.c#L82-L83 udelay on x86 is here: https://github.com/torvalds/linux/blob/28b8235/arch/x86/lib/delay.c#L207 udelay will repeatedly call rdtsc until it changes (some amount that should amount to 1us), but with the reduced resolution this will take (worst case) 16 minutes (1ms * 1000000) on Firefox on http and 20 seconds on https (0.02ms * 1000000). I can experimentally confirm the 20 second delay.On Chromium, the resolution is higher (0.1ms and 0.005ms respectively). Maybe your Firefox has a different resolution or doesn’t accept the COOP/COEP headers on https://copy.sh?
Now regarding fixes, there are a couple of options:
cpuid_level
to 0x14 makes Linux try to calibrate the tsc against the pic and fail (“Marking TSC unstable due to could not calculate TSC khz”), at least on Firefox where the timer resolution is 1msOr, since this is not really v86’s fault:
When I create an image using the latest Alpine Virtual 3.19.1 it outputs a message similar to “watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:1]” from the OP over and over with the seconds number growing each time and never gets to anything usable. Maybe I missed something?
@spetterman66 also seemed to confirm that only old versions work in an earlier message…