awkward: MacOS tests randomly segfault

Version of Awkward Array

HEAD

Description and code to reproduce

For a while now (weeks?), the MacOS tests/**/*.py have been failing with low probability—there’s about a 1 in 5 chance that one of the tests (based on Python 3.6, 3.7, 3.8, 3.9, and 3.10) raises a segfault. Here is an example. When that happens, the error code may be one of several signals, such as 6, 10, or 11. It often fails while running v2 tests, but that might just be because the v2 tests are after than the v1 tests. The fact that it hasn’t been failing consistently in the same test suggests that the signal is invoked during a garbage collection, but the actual error is earlier.

If this really is due to a v1 segfault, it’s rare enough that we could even ignore it—there won’t be many more v1 versions released. If it’s due to something in v2, then we really must take care of it. (And if it’s narrowed down to an easy v1 issue, then we’d also want to fix it in that case.)

If this had been a Linux segfault, I would run the tests locally many times in order to reproduce it (e.g. run it 100 times to virtually guarantee the P ≈ 1/25 case, then I would bisect the tests to find at least one that raises it. It might be necessary to insert explicit gc.collect() calls to trigger it in absence of many other tests. However, the bug only manifests itself on MacOS, and I don’t have access to a Mac to do that intensive testing on. (Doing it through CI would be awful because it would add the compilation time to each run.)

Does anyone (@ianna, @ioanaif, @swishdiff, @agoose77, …) have a Mac and is willing to try that bisection search? Maybe my first question should be if any Mac users have encountered this in their local tests (assuming you’ve been running enough tests recently).

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 18 (8 by maintainers)

Most upvoted comments

@jpivarski - as suggested I put one test in a loop: test_IndexedArray from

tests/v2/test_0404-array-validity-check.py

occasional failures confirm that the array data gets corrupted: an Index of an IndexedOptionArray in this case:

>       assert to_list(indexedarray.unique(axis=-1)) == [
            [6.6, 7.7, 8.8, 9.9],
            [5.5],
            [3.3, 4.4],
            [],
            [0.0, 1.1, 2.2],
        ]
E       assert [None, [6.6, ...3.3, 4.4], []] == [[6.6, 7.7, 8....0, 1.1, 2.2]]
E         At index 0 diff: None != [6.6, 7.7, 8.8, 9.9]
E         Full diff:
E         - [[6.6, 7.7, 8.8, 9.9], [5.5], [3.3, 4.4], [], [0.0, 1.1, 2.2]]
E         ?                                             ---------------- -
E         + [None, [6.6, 7.7, 8.8, 9.9], [5.5], [3.3, 4.4], []]
E         ?  ++++++

tests/v2/test_0404-array-validity-check.py:701: AssertionError

Here is a short test that can reproduce it:

    def orig_test_IndexedArray():
        listoffsetarray = ak._v2.operations.convert.from_iter(
            [[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]], highlevel=False
        )
    
        index = ak._v2.index.Index64(np.array([4, 3, 2, 1, 0], dtype=np.int64))
        indexedarray = ak._v2.contents.IndexedArray(index, listoffsetarray)
        assert to_list(indexedarray) == [
            [6.6, 7.7, 8.8, 9.9],
            [5.5],
            [3.3, 4.4],
            [],
            [0.0, 1.1, 2.2],
        ]
 
>       assert to_list(indexedarray.unique(axis=-1)) == [
            [6.6, 7.7, 8.8, 9.9],
            [5.5],
            [3.3, 4.4],
            [],
            [0.0, 1.1, 2.2],
        ]
E       assert [None, None, None, None, None] == [[6.6, 7.7, 8....0, 1.1, 2.2]]
E         At index 0 diff: None != [6.6, 7.7, 8.8, 9.9]
E         Full diff:
E         - [[6.6, 7.7, 8.8, 9.9], [5.5], [3.3, 4.4], [], [0.0, 1.1, 2.2]]
E         + [None, None, None, None, None]

tests/v2/test_0404-array-validity-check.py:701: AssertionError

v2 arrays are also immutable. Once constructed, we should never be overwriting any of their buffers, and the Python objects are replaced, rather than modified in place.

The fact that the segfault is in v2 is in another way good news: we can eliminate a variety of ways that v1 can produce errors. v2’s NumPy handling introduces significantly fewer avenues for segfaults than v1’s pointer handling. But there are a few that remain:

  • v2 arrays built via from_iter (perhaps indirectly through the highlevel Array constructor) are supposed to be allocated and owned by NumPy (in pybind11 C++ code, calling NumPy to allocate and own the buffers here, exposed to Python here, and assembled into a v2 array here). The fact that we copy from unique pointers now makes this a lot safer than it used to be.
  • v2 arrays build via from_numpy (perhaps indirectly through the highlevel Array constructor) are definitely allocated and owned by NumPy; there’s no C++ involved.
  • v2 arrays are either modified by NumPy (through nplike), which is completely safe, or they’re modified by passing them to kernels via ctypes. It could be a problem if the kernel function oversteps the bounds without checking—a bug in the kernel—though that sort of thing should have been caught by v1. However, 🤷, maybe the bug has always been there but it’s only getting triggered now, due to slightly different tests or (more likely) slightly different memory layouts (since it’s platform-dependent and doesn’t even happen on all MacOS runs). Another way the ctypes call could go wrong is if NumPy deallocates the buffer while the kernel is still running, but I think both Python’s argument refcount rules prohibit that (sys.getrefcount always returns at least 2) and we also follow a convention of only passing arguments in a kernel call with someindex.to(nplike), so the someindex name stays in scope beyond the lifetime of the kernel call, anyway. (We never construct objects in the argument list. But like I said, I think Python has friendly semantics for that, anyway.)

In v1, there are other ways to get segfaults, but I’m pretty sure the above is a complete list for v2. If you can narrow down which tests cause it, we have a short list of possible causes.

@ianna I think you said before that “merge many” is where it failed before, when it failed. I had been thinking it was in a garbage collection pass and not actually related to the mergemany method or this test. But if it’s always failing (when it fails) in test_0449-merge-many-arrays-in-one-pass.py even though it enters that function with all garbage already collected, that’s suggesting new information: that it really is something in test_0449-merge-many-arrays-in-one-pass.py that’s the problem.

If you run just that one test file, test_0449-merge-many-arrays-in-one-pass.py obscenely many times (10× or 100× your current runs, because it’s just one file), does it ever segfault? If so, it’s bracketed and we can narrow in.