tenderjit: Test suite occasionally fails, with a segfault at 0x0
I’ve noticed, while running the test suite for a large amount of sequential runs, that TJ sometimes segfaults.
The failure is very nondeterministic. Sometimes it takes a few runs, sometimes it doesn’t happen in a hundred runs. It does not depend on the test seed.
Since my last two PRs were quite sensitive, I’ve checked if the issue was present before they were merged, and I can confirm that the issue was already present.
Below there are some sample failures; I think the only pointers (haha) they give is that, since the segfault address is 0, this should be either a null pointer, or I think more likely, a misaligned stack.
A very long run on Mac may confirm if this is Linux-only, or cross-platform. But unfortunately, since even 100 runs don’t guarantee a failure, it may hard to reproduce.
Sample 1:
......S...........SS......SSS.S....S...S.SS....S.S..S../home/saverio/code/fisk-dev/lib/fisk.rb:847: [BUG] Segmentation fault at 0x0000000000000000
ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0040 p:---- s:0223 e:000222 CFUNC :zip
c:0039 p:0023 s:0218 e:000217 BLOCK /home/saverio/code/fisk-dev/lib/fisk.rb:847 [FINISH]
c:0038 p:---- s:0214 e:000213 IFUNC
c:0037 p:---- s:0211 e:000210 CFUNC :each
c:0036 p:---- s:0208 e:000207 CFUNC :find_all
c:0035 p:0007 s:0204 e:000203 METHOD /home/saverio/code/fisk-dev/lib/fisk.rb:845
c:0034 p:0022 s:0192 e:000191 METHOD /home/saverio/code/fisk-dev/lib/fisk/instructions.rb:1400
c:0033 p:0014 s:0187 e:000186 METHOD /home/saverio/code/tenderjit-dev/lib/tenderjit/runtime.rb:198
c:0032 p:0049 s:0181 e:000180 BLOCK /home/saverio/code/tenderjit-dev/lib/tenderjit/iseq_compiler.rb:2312
c:0031 p:0038 s:0177 e:000176 METHOD /home/saverio/code/tenderjit-dev/lib/tenderjit/iseq_compiler.rb:3001
c:0030 p:0011 s:0171 e:000170 METHOD /home/saverio/code/tenderjit-dev/lib/tenderjit/iseq_compiler.rb:2307
c:0029 p:0360 s:0166 e:000165 METHOD /home/saverio/code/tenderjit-dev/lib/tenderjit/iseq_compiler.rb:130
c:0028 p:0137 s:0154 e:000153 METHOD /home/saverio/code/tenderjit-dev/lib/tenderjit/iseq_compiler.rb:84
c:0027 p:0148 s:0149 e:000148 METHOD /home/saverio/code/tenderjit-dev/lib/tenderjit.rb:558
c:0026 p:0009 s:0140 e:000139 METHOD /home/saverio/code/tenderjit-dev/lib/tenderjit/iseq_compiler.rb:557 [FINISH]
Segmentation fault
Sample 2:
.S..S.............S.SS.S.S....SSS...SS...................SS.........SS......./home/saverio/code/tenderjit-dev/lib/tenderjit/iseq_compiler.rb:45: [BUG] Segmentation fault at 0x0000000000000000
ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0029 p:0294 s:0163 e:000162 METHOD /home/saverio/code/tenderjit-dev/lib/tenderjit/iseq_compiler.rb:45 [FINISH]
c:0028 p:---- s:0155 e:000154 CFUNC :new
c:0027 p:0125 s:0149 e:000148 METHOD /home/saverio/code/tenderjit-dev/lib/tenderjit.rb:552
c:0026 p:0009 s:0140 e:000139 METHOD /home/saverio/code/tenderjit-dev/lib/tenderjit/iseq_compiler.rb:557 [FINISH]
Segmentation fault
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 26 (18 by maintainers)
Commits related to this issue
- Merge pull request #121 from tenderlove/various-bugs Fix various bugs related to #110 — committed to tenderlove/tenderjit by tenderlove 2 years ago
I haven’t seen any segv errors since #121 so I’ll close this. Thanks so much for helping me find the issues! I couldn’t have done it without the
GC.starttip 😁I’m going to be pretty busy for the next week with holiday stuff, but I’ve been thinking about this problem and I wonder if it’s that we’re pushing a block handler on the stack but not executing the write barrier? Just a guess and a thought crossing my mind (I just wanted to write it down so when I come back to this I’ll remember 😅)
It’s possible. But looking at the stack trace it seems like maybe frame pushing is broken. We might be sticking the wrong value in the EP, then GC happens to run when there’s a frame that’s been pushed by the JIT. Also it looks like a problem with frames pushed when entering a block.
I’ll try again to reproduce this. Apparently the crash I was seeing is not this crash 😅
Amazing work! 🤩 🤩 🤩 I can’t reproduce the error(s) anymore (both by mass-running the test suite, and by running it with the seed that previously made it fail).
Thanks 🙏
If you don’t experience any error presumingly linked to this, it can be closed 😄
Actually now I’m not sure. It looks like we have a check for the write barrier. Maybe the exit isn’t working or there’s a bug in the predicate method?
@64kramsystem I’ll take a look at this today. I’ve been meaning to make a debug task because getting TJ running under a debugger isn’t really the easiest 😅