zephyr: kernel: thread: race condition between create and join

Describe the bug While investigating #56163 and digging through pthreads, testing showed that k_thread_create() and k_thread_join() also exhibit a race condition when re-using the same struct k_threads over and over again.

It’s not something regularly seen in production at the moment, and was only detected by accident. Originally reported here.

On the kernel side, this mainly seems to be an issue with smp platforms.

Please also mention any information which could help others to understand the problem you’re facing:

  • What target platform are you using? qemu_x86_64, qemu_cortex_a53_smp, qemu_riscv64_smp, qemu_riscv32_smp
  • What have you tried to diagnose or workaround this issue? Wrote a testsuite (#58115). Note, pthreads are disabled in this suite currently, so all failures are k_thread at the moment. It happens on all libc configurations and most smp platforms.
  • Is this a regression? Probably not although it’s hard to say.

To Reproduce Steps to reproduce the behavior:

  1. twister -i -T tests/posix/pthread_pressure
  2. see errors

Expected behavior Tests passing ideally 100% of the time on all platforms.

Impact It seems to be the opposite of what several contributors and maintainers expect, and is possibly just a corner case that did not receive a lot of traffic.

Logs and console output E.g.

ERROR   - *** Booting Zephyr OS build zephyr-v3.3.0-4124-gf67ff9c38640 ***
Running TESTSUITE pthread_pressure
===================================================================
START - test_k_thread_create_join
I: NUM_THREADS: 2
I: TEST_NUM_CPUS: 2
I: TEST_DURATION_S: 10
I: TEST_DELAY_US: 0
ASSERTION FAIL [0] @ WEST_TOPDIR/zephyr/kernel/sched.c:1785
	aborted _current back from dead
E:      a0: 0000000000000004    t0: 0000000000000000
E:      a1: 00000000000006f9    t1: 0000000000000009
E:      a2: 0000000080009d98    t2: 0000000000000000
E:      a3: 0000000000000000    t3: 0000000000000001
E:      a4: 0000000000000000    t4: 0000000000000023
E:      a5: 0000000000000001    t5: 000000008000b2b0
E:      a6: 0000000000000001    t6: 0000000080006514
E:      a7: 0000000000000001
E:      ra: 00000000800063c4
E:    mepc: 00000000800017e0
E: mstatus: 0000000a00021880
E: 
E: >>> ZEPHYR FATAL ERROR 4: Kernel panic on CPU 0
E: Current thread: (nil) (unknown)
E: Halting system

Environment (please complete the following information):

  • OS: Linux
  • Toolchain: Zephyr SDK v0.16.1
  • Commit SHA or Version used: d01780fc9403166d250ded3d02372a7e1a9db4fd (main), v2.7.4

Additional context #56163 #57637

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 27 (18 by maintainers)

Commits related to this issue

Most upvoted comments

So, I think the problem is that there is a race between z_thread_abort() and k_thread_join().

What happens is:

  1. A thread is marked as dead on a core at: https://github.com/zephyrproject-rtos/zephyr/blob/7457bcf0a2b3d3523b8b3ea971771337f3751f1f/kernel/sched.c#L1782 please note that this is done before any arch_switch() is actually happened yet.
  2. As soon as it is marked as such, on a different CPU, k_thread_join() is unblocked: https://github.com/zephyrproject-rtos/zephyr/blob/7457bcf0a2b3d3523b8b3ea971771337f3751f1f/kernel/sched.c#L1808-L1809 and a new thread (the same as the dying one) is created and scheduled before the dead one is actually swapped out already.

This patch is making the sample running fine for me:

diff --git a/kernel/sched.c b/kernel/sched.c
index a13f94bf3a..460767438d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1806,6 +1806,7 @@ int z_impl_k_thread_join(struct k_thread *thread, k_timeout_t timeout)
        SYS_PORT_TRACING_OBJ_FUNC_ENTER(k_thread, join, thread, timeout);
 
        if ((thread->base.thread_state & _THREAD_DEAD) != 0U) {
+               wait_for_switch(thread);
                ret = 0;
        } else if (K_TIMEOUT_EQ(timeout, K_NO_WAIT)) {
                ret = -EBUSY;

But to be clear, I’m convinced this is it. It’s exactly the same race we have in context swtich, and frankly your fix is exactly right. If it’s OK with you I’m going to clean it up with some refactoring and clearer docs to make it a more proper-looking scheduler API. But I promise I’ll credit you. 😃 Or you can just submit that as is and I can come with the cleanup later.

FWIW: this is also a good opportunity to add a call to the shiny new barrier API, which this layer needs theoretically (though on qemu it’s likely not emulated correctly and probably not part of the problem here).

FWIW: I just killed the RISC-V processes I had running from yesterday after 25 hours and no failures.

@povergoing thanks for the confirmation on FVP. And while the data point is only one sample, that’s about the reproduction rate @npitre and I measured. Frustratingly it looks like FVP is running this code about 30x slower than qemu.

Interestingly though, FVP is (I think?) more cycle-exact than qemu and doesn’t have the wild qemu time warp behavior due to scheduling CPUs on host threads. That the behavior is basically the same argues that the problems @cfriedt is seeing with CI/twister in the pthread PR are a different bug.

Been running these all morning collecting statistics. Both qemu_riscv32_smp and qemu_riscv64_smp have been running without trouble for 6.1 hours now, reaching 1.84G and 1.49G joins respectively (qemu runs at different rates for different architectures). To the extent we can measure right now, joins on these architectures are 100% bug free.

As mentioned, arm64 is the problem child. The rig above dies (usually just exiting qemu, sometimes with a deadlock) relatively promptly for me. Over a few dozen runs I measured an average uptime of 6.5M joins and 72 seconds.

But x86_64 has a real, measurable issue too: checking and restarting regularly, I’ve seen a total of seven individual failures with an average uptime of 344M joins in 38 minutes (it turns out that x86 is about 60% faster in qemu, which I guess isn’t surprising given that it matches the host architecture).

Not sure where to go from here though beyond careful reading of the platform interrupt and context switch code…

I confirm having the same experience: reproduced 3 times on ARM64 within 2 minutes (twice within 1 minute). Not reproducible on RISCV after several hours.

Obviously the reported error dump above comes from RISCV.

@cfriedt: Can you test with the simplified code from @andyross and confirm if you still get a failure on RISCV?

Oddly qemu just exits though and doesn’t print a panic, did we break something with panics in hello_world apps?

By default the hello_world app is configured with CONFIG_LOG and CONFIG_ASSERT unset. You need at least the former to get a printed panic.

OK, I have this sorta reproduced, but only on ARM. The test code as it is in the current PR doesn’t run long enough for me to see failures, does it for you? If I crank the runtime up to an hour, I can see qemu_cortex_a53_smp panic fairly reliably after 2-10 minutes and ~2M iterations (pretty randomly, with a lot of spread).

I pushed the runtime all the way up to the 6-hour maximum and let it run at bedtime. And I saw one panic on qemu_x86_64[1] after ~5 hours. I haven’t seen a failure on RISC-V at all.

The code below is the hello-world-ized variant I’m using right now that eliminates the yield/delay steps, the dependence on logging and pthreads, etc… It fails pretty reliably on a53 after about a minute, usually. (Oddly qemu just exits though and doesn’t print a panic, did we break something with panics in hello_world apps?). As with the original, it’s so far working reliably on the other SMP platforms.

So… basically my gut says that we have a SMP glitch in the arm64 code, almost certainly in interrupt entry/exit or context switch. It really doesn’t seem to be a race in the kernel per se. One theory, inspired by the recent discovery that x86 qemu correctly implements an obscure 1-instruction interrupt shadow slow, is that qemu is actually doing pedantic memory reordering and we’re missing some barrier instructions. Of all our platforms, arm64 has by far the trickiest memory model.

Unless there’s an easier way to repro this on other platforms?

[1] The huge difference in scale here points to this being a separate bug, IMHO. But once you get this far down to failures in the one-in-a-billion range, we need to start accepting that bugs in qemu or memory glitches on the host processor are possibilities.

#include <zephyr/kernel.h>

#define NTHREAD 2
#define STACKSZ 4096
#define LOG_INTERVAL_MS 2000

struct k_thread threads[NTHREAD];
K_KERNEL_STACK_ARRAY_DEFINE(stacks, NTHREAD, STACKSZ);

volatile int alive[NTHREAD];

void thread_fn(void *a, void *b, void *c)
{
	alive[(long)a] = true;
}

void spawn(int i)
{
	k_thread_create(&threads[i], stacks[i], STACKSZ,
			thread_fn, (void *)(long)i, NULL, NULL,
			-2, 0, K_NO_WAIT);
}

int main(void)
{
	for (int i = 0; i < NTHREAD; i++) {
		spawn(i);
	}

	uint64_t n = 0, next_log_ms = 0;

	while (true) {
		for (int i = 0; i < NTHREAD; i++) {
			if (alive[i]) {
				k_thread_join(&threads[i], K_FOREVER);
				alive[i] = false;
				spawn(i);
				n++;
			}
		}

		uint64_t now = k_uptime_get();

		if (now > next_log_ms) {
			printk("%lld joins in %lld ms\n", n, now);
			next_log_ms += LOG_INTERVAL_MS;
		}
	}

	return 0;
}