enkiTS: Deadlock in SemaphoreWait (macOS)

I’ve run into a deadlock in ThreadScheduler::SemaphoreWait.

I call WaitForTask on a reduction operation (very similar to the ParallelSum example) but almost immediately the whole application hangs. When I pause in the debugger I see the call stack below:

semaphore_wait_trap 0x000000019374fe90
_dispatch_sema4_wait 0x00000001935e099c
_dispatch_semaphore_wait_slow 0x00000001935e1050
[Inlined] enki::SemaphoreWait(enki::semaphoreid_t &) TaskScheduler.cpp:1381
enki::TaskScheduler::WaitForTaskCompletion(const enki::ICompletable *, unsigned int) TaskScheduler.cpp:742
enki::TaskScheduler::WaitforTask(const enki::ICompletable *, enki::TaskPriority) TaskScheduler.cpp:1002
...

I did a little poking around and found that I could stop the deadlock from occurring by modifying this line…

(TaskScheduler.cpp, line 630)

bool bCallWakeThreads = bWakeThreads_ && pTask_->m_WaitingForTaskCount.load( std::memory_order_acquire );

to…

bool bCallWakeThreads = bWakeThreads_ && pTask_->m_WaitingForTaskCount.load( std::memory_order_seq_cst );

Notice the change from std::memory_order_acquire to std::memory_order_seq_cst. I’m not sure this is the most optimal fix, but it might give you a clue as to what’s going wrong.

As an experiment, I also tried changing…

dispatch_semaphore_wait( semaphoreid.sem, DISPATCH_TIME_FOREVER );

to…

dispatch_semaphore_wait( semaphoreid.sem, DISPATCH_TIME_NOW );

And with that change alone, I get a hitch, but no deadlock.

If you have any thoughts on this I’d be very grateful to hear. For now, I’m going to make a fork with this change to unblock me. I’ve tried reproducing in a simpler application so I could share it but unfortunately, I wasn’t able to (it must be a gnarly timing thing). I’ve also only been able to reproduce this on the setup below. I tried running the same application on Windows with the same application state but also couldn’t reproduce it.

OS: macOS (Ventura 13.4) Hardware: Macbook Pro (M1 Max)

Thanks very much for your time, any questions feel free to give me a shout.

Update: The repro I have happens pretty much 100% consistently when running from CLion on macOS, for some reason, it’s much harder to recreate when running from command line, not entirely sure why 🙈

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (16 by maintainers)

Most upvoted comments

Excellent. I’m merging this into dev then main at the moment.

Many thanks for running this test!

I am a little puzzled as memory_order_seq_cst is only sequentially consistent with other memory_order_seq_cst operations on other threads - yet in this case there no other such operations on other threads except the threadState variable, which is unused by the TaskCompletion thread. So in this case memory_order_seq_cst and memory_order_acquire should be the same (since the operation is a load).

So I think memory_order_seq_cst fixing the problem likely as a side effect of the code the compiler creates for this.

I’ve pushed one further change which adds changes the fetch_sub on m_RunningCount read-modify-write operations to use memory_order_acq_rel. This should ensure that when m_WaitingForTaskCount is read by the thread completing the task, m_RunningCount has been decremented and is visible in the WaitForTaskCompletion thread in the correct order.

Thanks - it does seem from the above that there isn’t a deadlock in your code.

I have a small example close to what I was doing in the main app I can share if it would be useful (though I, unfortunately, couldn’t reproduce the problem with this).

It might be useful to see this, even if it can’t reproduce the problem.

One thing to check is that you haven’t deadlocked. There are a number of ways this can occur, but the most common is a WaitForTask running a task which calls WaitForTask on the parent. This can be avoided by using task priorities (in both the tasks and also in the WaitForTask call to prevent running lower priority tasks) or dependencies (dependencies are much safer and also better performance in general). You should normally be able to see this in a debugger by checking the thread callstacks.

Using std::memory_order_seq_cst might have fixed this by changing the performance in such a way that the deadlock doesn’t happen. However I’ll also go over the m_WaitingForTaskCount memory order semantics and check whether acquire/release isn’t sufficient here.