zephyr: intermittent SMP crashes on x86_64
I’m seeing some sporadic crashes on x86_64.
These crashes seem to have the following characteristics:
- Instruction pointer (RIP) is NULL
- It seems to happen when main is creating new child threads to run test cases, but I haven’t been able to pinpoint where or get a stack trace
Here’s an example, but I have seen this occur in a lot of tests:
*** Booting Zephyr OS build zephyr-v2.1.0-238-g5abb770487f7 ***
Running test suite test_sprintf
===================================================================
starting test - test_sprintf_double
SKIP - test_sprintf_double
===================================================================
starting test - test_sprintf_integer
E: ***** CPU Page Fault (error code 0x0000000000000010)
E: Supervisor thread executed address 0x0000000000000000
E: PML4E: 0x000000000011a827 Writable, User, Execute Enabled
E: PDPTE: 0x0000000000119827 Writable, User, Execute Enabled
E: PDE: 0x0000000000118827 Writable, User, Execute Enabled
E: PTE: Non-present
E: RAX: 0x0000000000000008 RBX: 0x0000000000000000 RCX: 0x00000000000f4240 RDX: 0x0000000000000000
E: RSI: 0x0000000000127000 RDI: 0x0000000000002710 RBP: 0x0000000000000000 RSP: 0x0000000000126fb0
E: R8: 0x000000000011cd0c R9: 0x0000000000000000 R10: 0x0000000000000000 R11: 0x0000000000000000
E: R12: 0x0000000001000000 R13: 0x0000000000000000 R14: 0x0000000000000000 R15: 0x0000000000000000
E: RSP: 0x0000000000126fb0 RFLAGS: 0x0000000000000202 CS: 0x0018 CR3: 0x000000000010a000
E: call trace:
E: RIP: 0x0000000000000000
E: NULL base ptr
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 1
E: Current thread: 0x000000000011c8a0 (main)
E: Halting system
Started noticing this after I enabled boot page tables. It’s unclear whether my work introduced this, or this was an issue that was already present, although I’m starting to suspect the latter since the code I brought in works great for 32-bit.
Due to sanitycheck automatic retries of failed test cases (see #14173) this has gone undetected in CI.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (14 by maintainers)
Commits related to this issue
- x86: qemu_x86_64: workaround SMP issues in x86 We have some races causing random failures with this platform, set cpu number to one while we investigate and fix the issue. Related to #21317 Signed-... — committed to nashif/zephyr by nashif 4 years ago
- x86: qemu_x86_64: workaround SMP issues in x86 We have some races causing random failures with this platform, set cpu number to one while we investigate and fix the issue. Related to #21317 Signed-... — committed to zephyrproject-rtos/zephyr by nashif 4 years ago
OK, we’re getting farther. There’s another, somewhat similar race involved with waiting for a z_swap() to complete. When swap begins, it does the scheduler work involved with re-queuing the _current thread (inside the scheduler lock, of course), and then it enters arch_switch() to do the actual context switch. But in the cycles between those two steps, the old/switching-from thread is in the queue and able to be run on the other CPU, despite the fact that it won’t have its registers saved until somewhere in the middle of arch_switch!
A PoC “fix” for x86_64 appears below. Basically it stuffs a magic cookie into the last field saved in switch, and spins for it to be saved in the scheduler before returning. Applying that results in an almost 100% reliable sanitycheck run (well under a failure per run, anyway, which is around the threshold where measurement gets confounded by the known timing slop).
Now I just need to find a way to do this portably and simply without putting too many weird requirements on the architecture layer…