tskit: crash inside tskit in SLiM's CI tests

The original issue is here:

https://github.com/MesserLab/SLiM/issues/334

So see that issue for details. The important bit is the crash log on GitHub Actions here:

https://github.com/MesserLab/SLiM/runs/7205270523?check_suite_focus=true

After it dawned on me (with illumination provided by @petrelharp) that the log in GitHub Actions actually provided a full backtrace already (I’m not used to Python-style backtraces :->), it became clear that the crash appears to be in variant stuff inside genotypes.py in tskit. It looks like it is triggered by for var in ts.variants(isolated_as_missing=False): at line 23 in SLiM’s test_consistency.py test script. Line numbers have shifted slightly, but the crash is somewhere vaguely around here: https://github.com/tskit-dev/tskit/blob/c12c1608aac8e2e826e53172c94c398d1972333b/python/tskit/genotypes.py#L237.

I believe @petrelharp is currently looking into this, but since it’s probably an issue with the new variant stuff, probably @benjeffery would be interested?

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 45 (38 by maintainers)

Commits related to this issue

Most upvoted comments

OK, I’ve tracked down the problem. It turns out the types were wrong in a few places here with respect to the user_alleles_mem, but the key one was that we were copying 8 times too much data from the source buffer and that’s where the segfault came from (sizeof(char *) instead of sizeof(char)).

I got a backtrace! I had to do

lldb /usr/local/miniconda/envs/anaconda-client-env/bin/python3 -- -m pytest test_consistency.py::TestWithMutations::test_ts_slim_consistency\[test_____sexual_nonwf.slim\]

and then in lldb do settings set target.disable-aslr false, then I got:

test_consistency.py Process 53035 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x7f9ae4ffd000)
    frame #0: 0x00007fff6eb07939 libsystem_platform.dylib`_platform_memmove$VARIANT$Nehalem + 281
libsystem_platform.dylib`_platform_memmove$VARIANT$Nehalem:
->  0x7fff6eb07939 <+281>: movups 0x20(%rsi,%rdx), %xmm7
    0x7fff6eb0793e <+286>: movups 0x30(%rsi,%rdx), %xmm8
    0x7fff6eb07944 <+292>: movaps %xmm1, (%rdi)
    0x7fff6eb07947 <+295>: movaps %xmm2, 0x10(%rdi)
Target 0: (python3) stopped.
(lldb) bt all
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x7f9ae4ffd000)
  * frame #0: 0x00007fff6eb07939 libsystem_platform.dylib`_platform_memmove$VARIANT$Nehalem + 281
    frame #1: 0x000000010233811e _tskit.cpython-38-darwin.so`tsk_memcpy + 14
    frame #2: 0x00000001023394a3 _tskit.cpython-38-darwin.so`tsk_variant_restricted_copy + 851
    frame #3: 0x00000001023317fc _tskit.cpython-38-darwin.so`Variant_restricted_copy + 76
    frame #4: 0x000000010186530f python3`method_vectorcall_NOARGS + 303
    frame #5: 0x000000010198d8bf python3`call_function + 159
    frame #6: 0x000000010198473a python3`_PyEval_EvalFrameDefault + 21658
    frame #7: 0x0000000101856625 python3`function_code_fastcall + 149
    frame #8: 0x000000010198d8bf python3`call_function + 159
    frame #9: 0x000000010198473a python3`_PyEval_EvalFrameDefault + 21658
    frame #10: 0x000000010186f6f8 python3`gen_send_ex + 168
    frame #11: 0x0000000101980202 python3`_PyEval_EvalFrameDefault + 3938
    frame #12: 0x000000010197d1c6 python3`_PyEval_EvalCodeWithName + 566
    frame #13: 0x0000000101856795 python3`_PyFunction_Vectorcall + 213
    frame #14: 0x000000010185a36a python3`method_vectorcall + 346
    frame #15: 0x0000000101855cf9 python3`PyVectorcall_Call + 121
    frame #16: 0x000000010198460e python3`_PyEval_EvalFrameDefault + 21358
    frame #17: 0x000000010197d1c6 python3`_PyEval_EvalCodeWithName + 566
    frame #18: 0x0000000101856795 python3`_PyFunction_Vectorcall + 213
    frame #19: 0x0000000101855cf9 python3`PyVectorcall_Call + 121
    frame #20: 0x000000010198460e python3`_PyEval_EvalFrameDefault + 21358
    frame #21: 0x000000010197d1c6 python3`_PyEval_EvalCodeWithName + 566
    frame #22: 0x0000000101856795 python3`_PyFunction_Vectorcall + 213
    frame #23: 0x000000010198d8bf python3`call_function + 159
    frame #24: 0x0000000101985b92 python3`_PyEval_EvalFrameDefault + 26866
    frame #25: 0x0000000101856625 python3`function_code_fastcall + 149
    frame #26: 0x000000010185a29d python3`method_vectorcall + 141
    frame #27: 0x000000010198d8bf python3`call_function + 159
    frame #28: 0x0000000101985b92 python3`_PyEval_EvalFrameDefault + 26866
    frame #29: 0x000000010197d1c6 python3`_PyEval_EvalCodeWithName + 566
    frame #30: 0x0000000101856795 python3`_PyFunction_Vectorcall + 213
    frame #31: 0x00000001018554c5 python3`_PyObject_FastCallDict + 229
    frame #32: 0x00000001018d752d python3`slot_tp_call + 173
    frame #33: 0x00000001018556c6 python3`_PyObject_MakeTpCall + 214
    frame #34: 0x000000010198d955 python3`call_function + 309
... <snip> ...
    frame #132: 0x00000001018554c5 python3`_PyObject_FastCallDict + 229
    frame #133: 0x00000001018d752d python3`slot_tp_call + 173
    frame #134: 0x00000001018556c6 python3`_PyObject_MakeTpCall + 214
    frame #135: 0x000000010198d955 python3`call_function + 309
    frame #136: 0x0000000101980d06 python3`_PyEval_EvalFrameDefault + 6758
    frame #137: 0x0000000101856625 python3`function_code_fastcall + 149
    frame #138: 0x000000010198d8bf python3`call_function + 159
    frame #139: 0x0000000101984407 python3`_PyEval_EvalFrameDefault + 20839
    frame #140: 0x0000000101856625 python3`function_code_fastcall + 149
    frame #141: 0x000000010198d8bf python3`call_function + 159
    frame #142: 0x0000000101985b92 python3`_PyEval_EvalFrameDefault + 26866
    frame #143: 0x000000010197d1c6 python3`_PyEval_EvalCodeWithName + 566
    frame #144: 0x0000000101977d62 python3`builtin_exec + 322
    frame #145: 0x00000001018b1274 python3`cfunction_vectorcall_FASTCALL + 84
    frame #146: 0x000000010198d8bf python3`call_function + 159
    frame #147: 0x0000000101984407 python3`_PyEval_EvalFrameDefault + 20839
    frame #148: 0x000000010197d1c6 python3`_PyEval_EvalCodeWithName + 566
    frame #149: 0x0000000101856795 python3`_PyFunction_Vectorcall + 213
    frame #150: 0x000000010198d8bf python3`call_function + 159
    frame #151: 0x0000000101984407 python3`_PyEval_EvalFrameDefault + 20839
    frame #152: 0x000000010197d1c6 python3`_PyEval_EvalCodeWithName + 566
    frame #153: 0x0000000101856795 python3`_PyFunction_Vectorcall + 213
    frame #154: 0x0000000101855cf9 python3`PyVectorcall_Call + 121
    frame #155: 0x0000000101a1665b python3`pymain_run_module + 235
    frame #156: 0x0000000101a1607e python3`pymain_run_python + 238
    frame #157: 0x0000000101a15f45 python3`Py_RunMain + 37
    frame #158: 0x0000000101a175c0 python3`pymain_main + 64
    frame #159: 0x0000000101829b88 python3`main + 56
    frame #160: 0x00007fff6e90acc9 libdyld.dylib`start + 1
  thread #2
    frame #0: 0x00007fff6ea4e872 libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff6eb13425 libsystem_pthread.dylib`_pthread_cond_wait + 698
    frame #2: 0x0000000107031d07 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 471
    frame #3: 0x0000000107012298 libomp.dylib`kmp_flag_64<false, true>::wait(kmp_info*, int, void*) + 1576
    frame #4: 0x000000010700ce1c libomp.dylib`__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) + 188
    frame #5: 0x0000000107011058 libomp.dylib`__kmp_fork_barrier(int, int) + 584
    frame #6: 0x0000000106fe770a libomp.dylib`__kmp_launch_thread + 330
    frame #7: 0x000000010703029a libomp.dylib`__kmp_launch_worker(void*) + 330
    frame #8: 0x00007fff6eb13109 libsystem_pthread.dylib`_pthread_start + 148
    frame #9: 0x00007fff6eb0eb8b libsystem_pthread.dylib`thread_start + 15
  thread #3
    frame #0: 0x00007fff6ea4e872 libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff6eb13425 libsystem_pthread.dylib`_pthread_cond_wait + 698
    frame #2: 0x0000000107031d07 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 471
    frame #3: 0x0000000107012298 libomp.dylib`kmp_flag_64<false, true>::wait(kmp_info*, int, void*) + 1576
    frame #4: 0x000000010700ce1c libomp.dylib`__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) + 188
    frame #5: 0x0000000107011058 libomp.dylib`__kmp_fork_barrier(int, int) + 584
    frame #6: 0x0000000106fe770a libomp.dylib`__kmp_launch_thread + 330
    frame #7: 0x000000010703029a libomp.dylib`__kmp_launch_worker(void*) + 330
    frame #8: 0x00007fff6eb13109 libsystem_pthread.dylib`_pthread_start + 148
    frame #9: 0x00007fff6eb0eb8b libsystem_pthread.dylib`thread_start + 15

I didn’t really do much besides complain to @petrelharp about how much I dislike Python. :->

Fix over in #2437. Thanks for all the hard work tracking this down @petrelharp and @bhaller!

Um, same spot, I think? But I think this will be a lot more obvious to Ben when he looks at this.

(Officially on holiday, but couldn’t help chipping in here as this is all my fault!)

No worries! I get to feel like I can do something in C. =) Enjoy the vacation!

But, for when you’re back, some more info: I tried changing the sizeof( ) in both malloc( ) and tsk_memcpy( ) and it didn’t change things (and, looking through it doesn’t seem wrong? although it’s not clear to me it should be char * and not char? However, if I leave the malloc alone but change the copy to

        tsk_memcpy(other->user_alleles_mem + offset, self->alleles[j],
            other->allele_lengths[j] * sizeof(char));

(not char *, which is 8x as large), then it runs okay. This doesn’t seem right still so I’m not going to file a PR.

However:

You might try changing: other->user_alleles_mem = tsk_malloc(total_len * sizeof(char *)); to other->user_alleles_mem = tsk_malloc(total_len * sizeof(char *) + 10000); just to see if that avoids the crash. If it does, the size of the dest is the problem; if it doesn’t, it’s the size of the src.

This still crashes, though? And even changing this to other->user_alleles_mem = tsk_malloc(100 * total_len * sizeof(char *)); doesn’t fix it. I’m confused!

Ah, good point, there is a complication here. For genotypes:81 the strings are null-terminated, as tsk_variant_init takes an array of null-terminated strings, but here I don’t think they are as they can come from the “non-user” alleles. I put the memcpy here instead as I realised that at the time, but I think I’ve gotten the type wrong for sizeof.

Any way that a -1 could end up getting converted to an unsigned that then gets passed to that particular malloc? That would trigger the failure, right?

No, it’s a fixed size (sizeof(tsk_variant_t)), so that couldn’t happen. It’s also a very small malloc, so super unlikely to be one that fails. I’ll commit in the fix anyway, though, as you say.

And, the mac in question isn’t ARM, it’s x86 (according to uname -a).

this is on a (macos-10.15, 3.8) github actions instance, and uname -a says

just chiming in to say macos 10.15 is pretty darn old at this point…

Yes; but this bug probably exists on other platforms too, but just doesn’t happen to get triggered. My guess is that it’s probably an uninitialized pointer or value, which often just happens to be zero, but which is sometimes non-zero and then boom. On some platforms that may be less likely to go boom, because of the platform’s kernel memory policies etc., but it’s still a bug that needs to be found and fixed. I would be very surprised if the bug was actually platform-specific, in the sense that there turns out to be no bug in the tskit code, but only a bug in macOS 10.15 that tskit collides with.

In any case, we have many users on old versions of macOS; SLiM supports back to macOS 10.13 IIRC, and I still get people complaining that they don’t want to upgrade. :-> People start a research project on a particular software stack, and they don’t want to touch that software stack until their project is completed. (Of course they should then not be updating to the latest tskit either, but… well, users, whaddya gonna do.)

I’ve verified that the segfault happens with the main branch of tskit.

But, after some pretty reliable appearing of the segfault (once every 2 or 3 times I run that script), now, in the same session, I can’t get it to reoccur. No idea why. I don’t know where to go from here.

Ok, I can get the segfault to happen while ssh’ed in. If the tree sequence produced by the script test_____sexual_nonwf.slim (attached; not really a .txt file: test.trees.txt) is saved as test.trees and then this script is run:

import tskit

ts = tskit.load("test.trees")

for var in ts.variants(isolated_as_missing=False):
    print(var)
    for k in ts.samples():
        geno = var.genotypes[k]
        msp_genotypes = var.alleles[geno].split(",")
        print("msp:", msp_genotypes)

then we get a segfault somewhere not far below 1/2 of the time. This is on a (macos-10.15, 3.8) github actions instance, and uname -a says

Darwin Mac-1658444984198.local 19.6.0 Darwin Kernel Version 19.6.0: Mon Apr 18 21:50:40 PDT 2022; root:xnu-6153.141.62~1/RELEASE_X86_64 x86_64

I cannot reproduce this on the mac I have access to.

To reproduce this while ssh’ing in, what needs to be done is:

source ~/.profile
set +e
set +o pipefail
cd Release
export PATH=$PWD:$PATH
cd ../treerec/tests
python3 -m pytest test_consistency.py::TestWithMutations::test_ts_slim_consistency[test_____sexual_nonwf.slim]

… repeating that last one until it fails.