iree: Intermittent failure of iree/task/queue_test on arm64

Seeing this intermittently on arm64 linux runners.

The following tests FAILED: 97 - iree/task/queue_test (Failed)

[ RUN      ] QueueTest.TryStealAll
/work/runtime/src/iree/task/queue_test.cc:310: Failure
Expected equality of these values:
  &task_c
    Which is: 0xffffc6ae6260
  iree_task_queue_try_steal(&source_queue, &target_queue, 1000)
    Which is: NULL

/work/runtime/src/iree/task/queue_test.cc:311: Failure
Expected equality of these values:
  &task_d
    Which is: 0xffffc6ae6[220](https://github.com/openxla/iree/actions/runs/6802769949/job/18496627564?pr=15470#step:6:221)
  iree_task_queue_pop_front(&target_queue)
    Which is: NULL

/work/runtime/src/iree/task/queue_test.cc:316: Failure
Value of: iree_task_queue_is_empty(&source_queue)
  Actual: false
Expected: true

Looks like some memory model, atomic, mumble-mumble thing.

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 46 (37 by maintainers)

Commits related to this issue

Most upvoted comments

@bjacob @benvanik Fix seem to work šŸ‘ Nice team effort getting this fixed. I learned a lot debugging this issue!

Yes, let’s make the test loop. If the impl was switched to compare_exchange_strong, that would amount to bringing that loop inside the impl. Instead we’re just adding the loop to the test. I think we can just let them loop indefinitely, just like compare_exchange_strong would. Worst case, there are test timeouts — but that would only happen if there’s a bug. Contention and implementation details would only result in a small number of loop iterations.

Wow, thanks for the debugging!

This has been a compare_exchange_weak ever since this file was first checked-in in 2020: https://github.com/openxla/iree/blob/24afcead3511e144bf4b55654a4c5b6930b75669/iree/base/synchronization.c

I added the comments much later as an explanation of this code. I thought, then, that the compare_exchange_weak was correct here. I will read your explanation and think about it – give me some time to catch up!

No worries! In C it’s always best to not rely on documentation for things like default initialization - that iree_task_list_split memset(0)'s the list does not mean we want to also not memset(0) it in the callers - compilers are better at eliding redundant memset(0)s than humans are at reading documentation and knowing when they need to do it themselves or not 😃

iree_task_queue_try_steal could be reworked to early exit if the try fails, but I wouldn’t change the API requirements of split (as that’s used elsewhere) - I believe iree_task_queue_try_steal did more in the past when the lock would fail (try stealing in a variety of ways, one of which was via the lock). I don’t know if I’d do any of that before figuring out the issue, though, as we have a reproducer of something that should work but isn’t and we don’t want to lose that.

I wasn’t suggesting a change to solve this problem, just flagging a potential improvement for later on. Also regarding the double initialization I was suggesting for it to be documented as part of split, thus removing the need for the initialization in the caller.

Unfortunately, I don’t see the issue - it’s still running 🤷

Two things I can think of doing next 1) look at the disassembly of the working / non-working queue test 2) rule out that the Ubuntu version has nothing to do with it. Will continue on Monday.