cmssw: testFWCoreUtilities fails in ARM IBs

Log: https://cmssdt.cern.ch/SDT/cgi-bin/logreader/cc8_aarch64_gcc9/CMSSW_12_0_X_2021-07-19-2300/unitTestLogs/FWCore/Utilities#/121-121

===== Test "testFWCoreUtilities" ====
Running ...............................F...

reusableobjectholder_t.cppunit.cpp:240:Assertion
Test name: reusableobjectholder_test::testSimultaneousUse
assertion failed
- Expression: t1ItemsSeen.size() > 0 && t1ItemsSeen.size() < 3

Failures !!!
Run: 34   Failure total: 1   Failures: 1   Errors: 0

---> test testFWCoreUtilities had ERRORS
TestTime:29
^^^^ End Test testFWCoreUtilities ^^^^

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 24 (21 by maintainers)

Most upvoted comments

I spent some time on this issue this morning, and I think I traced to the oneTBB implementation of concurrent_queue::try_pop() (see include/oneapi/tbb/concurrent_queue.h#L180-L195):

    bool internal_try_pop( void* dst ) {
        ticket_type k;
        do {
            k = my_queue_representation->head_counter.load(std::memory_order_relaxed);
            do {
                if (static_cast<std::ptrdiff_t>(my_queue_representation->tail_counter.load(std::memory_order_relaxed) - k) <= 0) {
                    // Queue is empty
                    return false;
                }

                // Queue had item with ticket k when we looked. Attempt to get that item.
                // Another thread snatched the item, retry.
            } while (!my_queue_representation->head_counter.compare_exchange_strong(k, k + 1));
        } while (!my_queue_representation->choose(k).pop(dst, k, *my_queue_representation, my_allocator));
        return true;
    }

Due to the first std::memory_order_relaxed, try_pop may sometime fail spuriously on architectures that have a relaxed memory ordering like Power and ARM.

Changing it to std::memory_order_acquire and rebuilding the test fixed it for me on a Power 8 machine (run successfully 20 times out of 20).

I’m unable to reproduce the failure locally. Likely related to the load of the ARM nodes (that isn’t too high at the moment). I’m thinking to enable the printout https://github.com/cms-sw/cmssw/blob/23f1aad119f27a6bfe1838e0c6f5c6f9f56f0b66/FWCore/Utilities/test/reusableobjectholder_t.cppunit.cpp#L240-L242 (but before the asserts) to see a bit more what is going on when the asserts fail.