tskit: crash inside tskit in SLiM's CI tests
The original issue is here:
https://github.com/MesserLab/SLiM/issues/334
So see that issue for details. The important bit is the crash log on GitHub Actions here:
https://github.com/MesserLab/SLiM/runs/7205270523?check_suite_focus=true
After it dawned on me (with illumination provided by @petrelharp) that the log in GitHub Actions actually provided a full backtrace already (I’m not used to Python-style backtraces :->), it became clear that the crash appears to be in variant stuff inside genotypes.py in tskit. It looks like it is triggered by for var in ts.variants(isolated_as_missing=False): at line 23 in SLiM’s test_consistency.py test script. Line numbers have shifted slightly, but the crash is somewhere vaguely around here: https://github.com/tskit-dev/tskit/blob/c12c1608aac8e2e826e53172c94c398d1972333b/python/tskit/genotypes.py#L237.
I believe @petrelharp is currently looking into this, but since it’s probably an issue with the new variant stuff, probably @benjeffery would be interested?
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 45 (38 by maintainers)
Commits related to this issue
- Fix segfault in tsk_variant_restricted_copy Incorrect buffer size calculations caused segfault when long alleles are present. Closes #2429 — committed to jeromekelleher/tskit by jeromekelleher 2 years ago
- Fix segfault in tsk_variant_restricted_copy Incorrect buffer size calculations caused segfault when long alleles are present. Closes #2429 — committed to jeromekelleher/tskit by jeromekelleher 2 years ago
- Fix segfault in tsk_variant_restricted_copy Incorrect buffer size calculations caused segfault when long alleles are present. Closes #2429 — committed to jeromekelleher/tskit by jeromekelleher 2 years ago
- Fix segfault in tsk_variant_restricted_copy Incorrect buffer size calculations caused segfault when long alleles are present. Closes #2429 — committed to jeromekelleher/tskit by jeromekelleher 2 years ago
- Fix segfault in tsk_variant_restricted_copy Incorrect buffer size calculations caused segfault when long alleles are present. Closes #2429 — committed to tskit-dev/tskit by jeromekelleher 2 years ago
OK, I’ve tracked down the problem. It turns out the types were wrong in a few places here with respect to the
user_alleles_mem, but the key one was that we were copying 8 times too much data from the source buffer and that’s where the segfault came from (sizeof(char *)instead ofsizeof(char)).I got a backtrace! I had to do
and then in
lldbdosettings set target.disable-aslr false, then I got:I didn’t really do much besides complain to @petrelharp about how much I dislike Python. :->
Fix over in #2437. Thanks for all the hard work tracking this down @petrelharp and @bhaller!
Um, same spot, I think? But I think this will be a lot more obvious to Ben when he looks at this.
No worries! I get to feel like I can do something in C. =) Enjoy the vacation!
But, for when you’re back, some more info: I tried changing the
sizeof( )in bothmalloc( )andtsk_memcpy( )and it didn’t change things (and, looking through it doesn’t seem wrong? although it’s not clear to me it should bechar *and notchar? However, if I leave the malloc alone but change the copy to(not
char *, which is 8x as large), then it runs okay. This doesn’t seem right still so I’m not going to file a PR.However:
This still crashes, though? And even changing this to
other->user_alleles_mem = tsk_malloc(100 * total_len * sizeof(char *));doesn’t fix it. I’m confused!Ah, good point, there is a complication here. For genotypes:81 the strings are null-terminated, as
tsk_variant_inittakes an array of null-terminated strings, but here I don’t think they are as they can come from the “non-user” alleles. I put thememcpyhere instead as I realised that at the time, but I think I’ve gotten the type wrong forsizeof.No, it’s a fixed size
(sizeof(tsk_variant_t)), so that couldn’t happen. It’s also a very small malloc, so super unlikely to be one that fails. I’ll commit in the fix anyway, though, as you say.And, the mac in question isn’t ARM, it’s x86 (according to
uname -a).Yes; but this bug probably exists on other platforms too, but just doesn’t happen to get triggered. My guess is that it’s probably an uninitialized pointer or value, which often just happens to be zero, but which is sometimes non-zero and then boom. On some platforms that may be less likely to go boom, because of the platform’s kernel memory policies etc., but it’s still a bug that needs to be found and fixed. I would be very surprised if the bug was actually platform-specific, in the sense that there turns out to be no bug in the tskit code, but only a bug in macOS 10.15 that tskit collides with.
In any case, we have many users on old versions of macOS; SLiM supports back to macOS 10.13 IIRC, and I still get people complaining that they don’t want to upgrade. :-> People start a research project on a particular software stack, and they don’t want to touch that software stack until their project is completed. (Of course they should then not be updating to the latest tskit either, but… well, users, whaddya gonna do.)
I’ve verified that the segfault happens with the
mainbranch of tskit.But, after some pretty reliable appearing of the segfault (once every 2 or 3 times I run that script), now, in the same session, I can’t get it to reoccur. No idea why. I don’t know where to go from here.
Ok, I can get the segfault to happen while ssh’ed in. If the tree sequence produced by the script
test_____sexual_nonwf.slim(attached; not really a .txt file: test.trees.txt) is saved astest.treesand then this script is run:then we get a segfault somewhere not far below 1/2 of the time. This is on a (macos-10.15, 3.8) github actions instance, and
uname -asaysI cannot reproduce this on the mac I have access to.
To reproduce this while ssh’ing in, what needs to be done is:
… repeating that last one until it fails.