zephyr: intermittent SMP crashes on x86_64

I’m seeing some sporadic crashes on x86_64.

These crashes seem to have the following characteristics:

  1. Instruction pointer (RIP) is NULL
  2. It seems to happen when main is creating new child threads to run test cases, but I haven’t been able to pinpoint where or get a stack trace

Here’s an example, but I have seen this occur in a lot of tests:

*** Booting Zephyr OS build zephyr-v2.1.0-238-g5abb770487f7  ***
Running test suite test_sprintf
===================================================================
starting test - test_sprintf_double
SKIP - test_sprintf_double
===================================================================
starting test - test_sprintf_integer
E: ***** CPU Page Fault (error code 0x0000000000000010)
E: Supervisor thread executed address 0x0000000000000000
E: PML4E: 0x000000000011a827 Writable, User, Execute Enabled
E: PDPTE: 0x0000000000119827 Writable, User, Execute Enabled
E:   PDE: 0x0000000000118827 Writable, User, Execute Enabled
E:   PTE: Non-present
E: RAX: 0x0000000000000008 RBX: 0x0000000000000000 RCX: 0x00000000000f4240 RDX: 0x0000000000000000
E: RSI: 0x0000000000127000 RDI: 0x0000000000002710 RBP: 0x0000000000000000 RSP: 0x0000000000126fb0
E:  R8: 0x000000000011cd0c  R9: 0x0000000000000000 R10: 0x0000000000000000 R11: 0x0000000000000000
E: R12: 0x0000000001000000 R13: 0x0000000000000000 R14: 0x0000000000000000 R15: 0x0000000000000000
E: RSP: 0x0000000000126fb0 RFLAGS: 0x0000000000000202 CS: 0x0018 CR3: 0x000000000010a000
E: call trace:
E: RIP: 0x0000000000000000
E: NULL base ptr
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 1
E: Current thread: 0x000000000011c8a0 (main)
E: Halting system

Started noticing this after I enabled boot page tables. It’s unclear whether my work introduced this, or this was an issue that was already present, although I’m starting to suspect the latter since the code I brought in works great for 32-bit.

Due to sanitycheck automatic retries of failed test cases (see #14173) this has gone undetected in CI.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (14 by maintainers)

Commits related to this issue

Most upvoted comments

OK, we’re getting farther. There’s another, somewhat similar race involved with waiting for a z_swap() to complete. When swap begins, it does the scheduler work involved with re-queuing the _current thread (inside the scheduler lock, of course), and then it enters arch_switch() to do the actual context switch. But in the cycles between those two steps, the old/switching-from thread is in the queue and able to be run on the other CPU, despite the fact that it won’t have its registers saved until somewhere in the middle of arch_switch!

A PoC “fix” for x86_64 appears below. Basically it stuffs a magic cookie into the last field saved in switch, and spins for it to be saved in the scheduler before returning. Applying that results in an almost 100% reliable sanitycheck run (well under a failure per run, anyway, which is around the threshold where measurement gets confounded by the known timing slop).

Now I just need to find a way to do this portably and simply without putting too many weird requirements on the architecture layer…

diff --git a/arch/x86/core/intel64/locore.S b/arch/x86/core/intel64/locore.S
index e38a0fa993..5140c3a0e8 100644
--- a/arch/x86/core/intel64/locore.S
+++ b/arch/x86/core/intel64/locore.S
@@ -212,7 +212,6 @@ z_x86_switch:
        movq %r12, _thread_offset_to_r12(%rsi)
        movq %r13, _thread_offset_to_r13(%rsi)
        movq %r14, _thread_offset_to_r14(%rsi)
-       movq %r15, _thread_offset_to_r15(%rsi)
 #ifdef CONFIG_USERSPACE
        /* We're always in supervisor mode if we get here, the other case
         * is when __resume is invoked from irq_dispatch
@@ -220,6 +219,8 @@ z_x86_switch:
        movq $X86_KERNEL_CS, _thread_offset_to_cs(%rsi)
        movq $X86_KERNEL_DS, _thread_offset_to_ss(%rsi)
 #endif
+       // HACK: move R15 to the end so it's the last value saved
+       movq %r15, _thread_offset_to_r15(%rsi)
        movq %gs:__x86_tss64_t_ist1_OFFSET, %rsp
 
        /* fall through to __resume */
diff --git a/kernel/include/kswap.h b/kernel/include/kswap.h
index 3537a24d21..e98a6cf609 100644
--- a/kernel/include/kswap.h
+++ b/kernel/include/kswap.h
@@ -57,8 +57,19 @@ static ALWAYS_INLINE unsigned int do_swap(unsigned int key,
                k_spin_release(lock);
        }
 
+       // HACK: poison the last value saved in arch_switch, so we can
+       // tell when it's done (i.e. when this magic number gets
+       // clobbered)
+       u64_t old_r15 = old_thread->callee_saved.r15;
+       old_thread->callee_saved.r15 = 0xff55aa11;
+
        new_thread = z_get_next_ready_thread();
 
+       if (new_thread == old_thread) {
+               // (HACK: not switching, put it back)
+               _current->callee_saved.r15 = old_r15;
+       }
+
        if (new_thread != old_thread) {
                sys_trace_thread_switched_out();
 #ifdef CONFIG_TIMESLICING
diff --git a/kernel/sched.c b/kernel/sched.c
index 2ebc9d4be9..007fc5bab4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -248,6 +248,12 @@ static ALWAYS_INLINE struct k_thread *next_up(void)
        }
        z_mark_thread_as_not_queued(thread);
 
+       // HACK: spin for an incoming switch
+       if (thread != _current) {
+               volatile u64_t *xx = &thread->callee_saved.r15;
+               while (*xx == 0xff55aa11);
+       }
+
        return thread;
 #endif
 }