iree: Intermittent failure of iree/task/queue_test on arm64
Seeing this intermittently on arm64 linux runners.
The following tests FAILED: 97 - iree/task/queue_test (Failed)
[ RUN ] QueueTest.TryStealAll
/work/runtime/src/iree/task/queue_test.cc:310: Failure
Expected equality of these values:
&task_c
Which is: 0xffffc6ae6260
iree_task_queue_try_steal(&source_queue, &target_queue, 1000)
Which is: NULL
/work/runtime/src/iree/task/queue_test.cc:311: Failure
Expected equality of these values:
&task_d
Which is: 0xffffc6ae6[220](https://github.com/openxla/iree/actions/runs/6802769949/job/18496627564?pr=15470#step:6:221)
iree_task_queue_pop_front(&target_queue)
Which is: NULL
/work/runtime/src/iree/task/queue_test.cc:316: Failure
Value of: iree_task_queue_is_empty(&source_queue)
Actual: false
Expected: true
Looks like some memory model, atomic, mumble-mumble thing.
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 46 (37 by maintainers)
Commits related to this issue
- last-ditch effort to reproduce #15488 — committed to bjacob/iree by bjacob 8 months ago
- Fix intermittent failure - functions with `_try_` in their name may fail spuriously. (#15636) Fixes #15488. This is the fix discussed there. Hat tip to @RoboTux and @freddan80 for the tenacious d... — committed to iree-org/iree by bjacob 7 months ago
- Fix intermittent failure - functions with `_try_` in their name may fail spuriously. (#15636) Fixes #15488. This is the fix discussed there. Hat tip to @RoboTux and @freddan80 for the tenacious d... — committed to ramiro050/iree by bjacob 7 months ago
@bjacob @benvanik Fix seem to work š Nice team effort getting this fixed. I learned a lot debugging this issue!
Yes, letās make the test loop. If the impl was switched to
compare_exchange_strong
, that would amount to bringing that loop inside the impl. Instead weāre just adding the loop to the test. I think we can just let them loop indefinitely, just likecompare_exchange_strong
would. Worst case, there are test timeouts ā but that would only happen if thereās a bug. Contention and implementation details would only result in a small number of loop iterations.Wow, thanks for the debugging!
This has been a compare_exchange_weak ever since this file was first checked-in in 2020: https://github.com/openxla/iree/blob/24afcead3511e144bf4b55654a4c5b6930b75669/iree/base/synchronization.c
I added the comments much later as an explanation of this code. I thought, then, that the compare_exchange_weak was correct here. I will read your explanation and think about it ā give me some time to catch up!
No worries! In C itās always best to not rely on documentation for things like default initialization - that iree_task_list_split memset(0)'s the list does not mean we want to also not memset(0) it in the callers - compilers are better at eliding redundant memset(0)s than humans are at reading documentation and knowing when they need to do it themselves or not š
I wasnāt suggesting a change to solve this problem, just flagging a potential improvement for later on. Also regarding the double initialization I was suggesting for it to be documented as part of split, thus removing the need for the initialization in the caller.
Unfortunately, I donāt see the issue - itās still running š¤·
Two things I can think of doing next 1) look at the disassembly of the working / non-working queue test 2) rule out that the Ubuntu version has nothing to do with it. Will continue on Monday.