scipy: eigh() tests fail to pass, crash Python with seemingly ramdom pattern

This problem is related to #11601, which has been closed by #11702 ( @ilayn ). However, the crash has not been fixed by the latter PR.

The symptoms remained almost identical to the one described in my comment in https://github.com/scipy/scipy/issues/11601#issuecomment-600153321

In summary, when running the test for eigh(), Python tends to crash with SIGSEGV or SIGABRT. Sometimes this happens during the test_eigh() function, sometimes after it passed with “100%” but before pytest returns.

The test that triggers the crash is the following test function:

https://github.com/scipy/scipy/blob/ae34ce4835949a8310d7c3d7bcb4a55aafd11f4f/scipy/linalg/tests/test_decomp.py#L863-L888

Some patterns from the histories of crashes

I run the test script with runtests.py 100 times and saved the output as text files.

By grepping the output files ./runtests.py, I notice that the last-known position in Python before it crashes could be three lines, namely 873, 876, and 877. L 873 is the actual call to eigh(), while the crash can happen as late as 876 or 877, where the arrays returned from eigh() are accessed.

Only 6 out of 100 runs passed without any problems.

In some cases (35 out of the 100), Python segfaults after nominally completing all the tests in TestEigh::test_eigh.

In the cases where Python was killed with SIGABRT, 36 were at L 873 (call to eigh()), while 9 were at L 876 where output z was used. In many other runs, the test script was not featured in the Python backtrace if any.

The parametrized inputs that triggered the crash were of the form test_eigh[6-D-XXX-YYY-ZZZ-eigvals1]. That is, the crashes happened for dimension 6, dtype double complex, with eigvals= keyword parameter set to the tuple (2, 4). The XXXZZZ parameters are boolean flags for keywords turbo, lower, and overwrite respectively.

An incomplete tally of the parameters (turbo, lower, and overwrite), where Python crashed before finishing all the tests, is as follows:

   5 False-False-False
  11 False-False-True
  13 False-True-False
   6 False-True-True
   7 True-False-False
   4 True-False-True
  15 True-True-True

The combination (turbo=True, lower=True, overwrite=False) is the one missing from the 2^3 = 8 cases yet.

Reproducing code example:

./runtests.py -vt scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh

Scipy/Numpy/Python version information:

Scipy master branch as of ae34ce48, Numpy 1.18.1, Python 3.7.6, conda macos with MKL 2019.4.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 52 (52 by maintainers)

Commits related to this issue

Most upvoted comments

Intel team confirmed the bug and included the fix for the upcoming MKL 2020 update 2.

This may well be more noise, but just in case it may help, I created a stripped down version in pure C that mostly performs the LAPACK operations done by the Python test snippet, in the hope that one may verify whether this is an underlying C-level MKL issue. The program appears to run fine without any problem. I ran the program in loops that repeat 5000 times per loop, and I have yet to run into a crash.


The C program is as follows:

#include <stdio.h>
#include "mkl.h"


#define ARR_SIZE	(36)
#define ARR_DIM		(6)
#define ARR_SUPP	(12)
#define ARR_C		(18)


static void print_matrix_colmajor(const lapack_complex_double * restrict a,
                                  lapack_int nr, lapack_int nc)
{
    int i, j;
    for ( i = 0; i < nr; ++i )
    {
	for ( j = 0; j < nc; ++j )
	{
	    lapack_complex_double ze;
	    int k;
	    k = i + j * nr;
	    ze = a[k];
	    printf(" %# .8f%+.8gj", ze.real, ze.imag);
	}
	printf(",\n");
    }
}


int main(int argc, char * argv[])
{
    /* Initial input matrix data. */
    const double ar[ARR_SIZE] =
    {
	0.57830695, 0.57732626, 0.50339199, 0.49955715, 0.65422828, 0.39991909,
	0.57732626, 0.26035319, 0.98113963, 0.63757905, 0.79092958, 0.72842587,
	0.50339199, 0.98113963, 0.26858524, 0.61785254, 0.62372901, 0.63216722,
	0.49955715, 0.63757905, 0.61785254, 0.78260145, 0.77017289, 0.82255765,
	0.65422828, 0.79092958, 0.62372901, 0.77017289, 0.78081655, 0.50652208,
	0.39991909, 0.72842587, 0.63216722, 0.82255765, 0.50652208, 0.00767085
    };
    /* Complexified input matrix */
    lapack_complex_double a[ARR_SIZE];
    lapack_complex_double ap[ARR_SIZE];
    int i;

    /* Create the input matrix with the correct type that will be fed to
     * zheevr() */
    for ( i = 0; i < ARR_SIZE; ++i )
    {
	a[i].real = ar[i];
	a[i].imag = 0.0;
	ap[i].real = ar[i];
	ap[i].imag = 0.0;
    }

    lapack_int m, info;
    double w[ARR_DIM];
    lapack_complex_double z[ARR_SIZE];
    lapack_int isuppz[ARR_SUPP];

    info = LAPACKE_zheevr(LAPACK_COL_MAJOR, 'V', 'I', 'U', ARR_DIM, a, ARR_DIM,
	    0.0, 0.0, 3, 5, 0.0, &m, w, z, ARR_DIM, isuppz);

    printf("info: %d\n", info);
    printf("m: %d\n", m);
    printf("w: [");
    for ( i = 0; i < m; ++i )
    {
	printf(" %# .8f", w[i]);
    }
    printf("]\n");

    printf("z: [");
    print_matrix_colmajor(z, ARR_DIM, m);
    printf("]\n");

    /* Calculate the vector-matrix dot product z.H @ a */
    lapack_complex_double alpha = {1.0, 0.0};
    lapack_complex_double beta = {0.0, 0.0};
    lapack_complex_double c[ARR_C];

    cblas_zgemm(CblasColMajor, CblasConjTrans, CblasNoTrans,
	    m, ARR_DIM, ARR_DIM,
	    &alpha, z, ARR_DIM, ap, ARR_DIM,
	    &beta, c, m);

    printf("c = z.H @ a: [");
    print_matrix_colmajor(c, m, ARR_DIM);
    printf("]\n");
}

Running the compiled program linked against MKL 2019.4 gives the following output

info: 0
m: 3
w: [ -0.01350094  0.07149590  0.24539661]
z: [ -0.60090663+0j  0.17259041+0j  0.69638956+0j,
  0.11969248+0j -0.46110847+0j  0.02372352+0j,
  0.04234416+0j -0.56691172+0j -0.03759849+0j,
 -0.28507343+0j  0.49136360+0j -0.58940264+0j,
  0.71896113+0j  0.41897405+0j  0.28269445+0j,
 -0.15690741+0j -0.13865491+0j -0.29283701+0j,
]
c = z.H @ a: [  0.00811281+0j -0.00161596+0j -0.00057169+0j  0.00384876+0j -0.00970665+0j  0.00211840+0j,
  0.01233951+0j -0.03296736+0j -0.04053186+0j  0.03513048+0j  0.02995493+0j -0.00991326+0j,
  0.17089164+0j  0.00582167+0j -0.00922654+0j -0.14463741+0j  0.06937226+0j -0.07186121+0j,
]

Not the failures but actually the ones that pass makes me worried more. This signals that I have to dig in to the LAPACK wrappers instead. Because this really doesn’t make any sense.

The returned array sizes are almost random. I don’t see any underflow pattern hence probably they are random memory values which implies f2py is not returning a proper object which in turn our wrappers are not correct.

These tests were converted to pytest decorators very recently and maybe we have discovered a bug that were not tested properly. I’ll check this properly on a Linux box (I’m on Win10) when I can.

Note sure if this is helpful but posting:

(scipy-dev) Lucass-MacBook:scipy rlucas$ ./runtests.py -vt scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh

... elided bulk of passing tests output ...
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-True-True-None] PASSED                                                                                       [ 89%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-True-True-eigvals1] FAILED                                                                                   [ 90%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-True-False-None] PASSED                                                                                      [ 92%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-True-False-eigvals1] PASSED                                                                                  [ 93%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-False-True-None] PASSED                                                                                      [ 95%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-False-True-eigvals1] PASSED                                                                                  [ 96%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-False-False-None] PASSED                                                                                     [ 98%]
scipy/linalg/tests/test_decomp.py::TestEigh::test_eigh[6-D-False-False-False-eigvals1] PASSED                                                                                 [100%]

===================================================================================== FAILURES ======================================================================================
_________________________________________________________________ TestEigh.test_eigh[6-D-False-True-True-eigvals1] __________________________________________________________________
scipy/linalg/tests/test_decomp.py:876: in test_eigh
    diag_ = diag(z.T.conj() @ a @ z).real
E   ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size -2308328180370374638 is different from 6)
======================================================================== 1 failed, 63 passed in 0.99 seconds ========================================================================
(scipy-dev) Lucass-MacBook:scipy rlucas$