sunpy: Intermittent image-rotation test failures on OS X when using conda

We have intermittent failures of our image rotation tests and these failures appear to be isolated to OS X when using conda(-forge). #4235 added raw output for these failures occurred, and some of the output is truly bizarre. I will add investigative stuff in separate posts. My current conjecture is there is nothing wrong with SunPy code, but rather a C extension in the numpy/scipy/scikit-image ecosystem is not being compiled for conda(-forge) with the correct compile options for OS X such that there’s the intermittent potential for bad memory access of arrays.


Edit: go down to https://github.com/sunpy/sunpy/issues/4290#issuecomment-676573472 for a summary of the current understanding

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 16 (14 by maintainers)

Most upvoted comments

Latest facts

  • Test failures do not occur at all when using MKL libraries instead of OpenBLAS libraries.
  • Test failures are significantly more likely when using multiple workers in pytest (i.e., pytest-xdist), but they can still occur with only a single worker (specifying -n 1 or -n 0).
  • While the test failures occur in our function calling a scikit-image function, the actual line that can produce incorrect results is a NumPy matrix multiplication of a large matrix, which then calls OpenBLAS.
    • The reason why all the mismatches look like indexing problems is because this matrix multiplication is the transformation of the indices. If this matrix multiplication is corrupted, the image array ends up being incorrectly accessed.
  • Unlike MKL, OpenBLAS has been known to have thread-safety issues: when called from programs with multiple threads, there can be conflicts because OpenBLAS is typically configured to also use multithreading internally.
    • The commonly suggested workaround is to set the environment variable OMP_NUM_THREADS=1 to configure OpenBLAS to use only a single thread. (Incidentally, the environment variable OPENBLAS_NUM_THREADS=1 would also work if OpenBLAS is compiled to use pthreads threads, but conda-forge compiles OpenBLAS to use OpenMP threads. For whatever reason, OMP_NUM_THREADS=1 works regardless of which type of threads OpenBLAS has been compiled to use.) This can have negative impacts on the performance, and as an environment variable, can also affect other libraries.
      • Setting this environment variable does in fact result in ZERO test failures for us!

Conjecture

  • Something particular to the OS X test setup is causing the tests to run with multithreading, which then leads to the bad behavior when OpenBLAS is also multithreading.

For tests

There are two ways to get our OS X conda tests to pass reliably:

  • Add the environment variable OMP_NUM_THREADS=1
  • Force the BLAS libraries to be the MKL libraries

For users

I still don’t know if a typical user can ever trigger these errors outside of running tests. But, if it is possible, corrupted matrix multiplications are indisputably bad. I’m inclined to force the MKL libraries as an explicit dependency, because disabling multithreading seems like far too sweeping of a fix, with the potential of doing far more harm than good.

Arrays should be defaulting to row-major and C contiguous, yet these problem segments are both columns. I don’t know what to make of that.