tarpaulin: Instrumentation address clash errors and segfaults
as seen in failing builds https://circleci.com/gh/holochain/holochain-rust/1514
using 0.6.11
and nightly-2018-12-26
i’ve tried various flags passed to tarpaulin, but builds always seem to fail with this error
also seeing Error a segfault occured when executing test
potentially related https://github.com/xd009642/tarpaulin/issues/35 as we are using threads
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 83 (42 by maintainers)
Commits related to this issue
- temporarily disable code coverage on travis rust-lang/rust#52478 xd009642/tarpaulin#161 xd009642/tarpaulin#190 — committed to brunocodutra/reducer by brunocodutra 5 years ago
- temporarily disable code coverage on travis rust-lang/rust#52478 xd009642/tarpaulin#161 xd009642/tarpaulin#190 — committed to brunocodutra/reducer by brunocodutra 5 years ago
- temporarily disable code coverage on travis rust-lang/rust#52478 xd009642/tarpaulin#161 xd009642/tarpaulin#190 — committed to brunocodutra/reducer by brunocodutra 5 years ago
- temporarily disable code coverage on travis rust-lang/rust#52478 xd009642/tarpaulin#161 xd009642/tarpaulin#190 — committed to brunocodutra/reducer by brunocodutra 5 years ago
- travis: disable coverage on tests with threads Spawning threads in tests hits a tarpaulin bug preventing collection of coverage, so disable them for now. cf https://github.com/xd009642/tarpaulin/iss... — committed to Smithay/wayland-rs by elinorbgr 5 years ago
- chore(ci): don't run coverage tests for now They fail due to https://github.com/xd009642/tarpaulin/issues/190 — committed to Cogitri/tmplgen by deleted user 5 years ago
- chore(ci): don't run coverage tests for now They fail due to https://github.com/xd009642/tarpaulin/issues/190 — committed to Cogitri/tmplgen by deleted user 5 years ago
- travis: run tarpaulin on stable tarpaulin can now again be compiled on stable, and when run on stable it does not yet show the multithreading bug. see https://github.com/xd009642/tarpaulin/issues/19... — committed to Smithay/wayland-rs by elinorbgr 5 years ago
- travis: run tarpaulin on stable tarpaulin can now again be compiled on stable, and when run on stable it does not yet show the multithreading bug. see https://github.com/xd009642/tarpaulin/issues/19... — committed to Smithay/wayland-rs by elinorbgr 5 years ago
- Working on #190 and #207 The two issues seem related so grouping all work here. From what I've found there's some issues with bad jumps or addresses during the run which can cause these errors either... — committed to xd009642/tarpaulin by xd009642 5 years ago
- Merge branch 'event_queue' into develop This fixes #190 and potentially some other things. — committed to xd009642/tarpaulin by xd009642 5 years ago
- Travis CI: Prevent acceptor tests from being run by tarpaulin The tests require creating threadpools, which will not work until https://github.com/xd009642/tarpaulin/issues/190 is fixed. — committed to str4d/ire by str4d 5 years ago
- Travis CI: Prevent acceptor tests from being run by tarpaulin The tests require creating threadpools, which will not work until https://github.com/xd009642/tarpaulin/issues/190 is fixed. — committed to str4d/ire by str4d 5 years ago
- Ignore coverage for now It keeps failing due to xd009642/tarpaulin#190. — committed to jonhoo/inferno by jonhoo 5 years ago
- Disable coverage for now It keeps failing due to xd009642/tarpaulin#190 — committed to jonhoo/left-right by jonhoo 5 years ago
- Don't run tarpaulin with '--test-threads=1' See https://github.com/xd009642/tarpaulin/issues/190 and https://xd009642.github.io/2019/10/02/Tarpaulin-and-the-futures.html — committed to nbigaouette/hygeia by nbigaouette 5 years ago
And it passed! I’ll keep it in develop until the 3rd and then I’ll merge it, just to give anyone a chance to find issues. I’m very happy with this currently though I’m seeing tarpaulin do things it’s never done before 😄
Nice! So as this issue ended up resolving two different issues I’m closing it and if there’s another segfault or sigill a new issue should be opened 🎉
Can’t wait for the fix ! Meantime, I found a workaround by forcing execution to a single core using
taskset -c 0 cargo tarpaulin ...
or forcing Docker to a single core usingdocker run --cpuset-cpus="0" ...
It’s much slower obviously, but still better than having no coverage report.
Yeah that was it. And I’m doing a run of 100. With latest master tarpaulin it segfaulted every time I tried. So far this fix has worked every time (but that’s just 3 times). I’ll let it get through all 100 runs but if it passes that I’m merging and might do a release tonight 👀
Edit: Release tonight is coming, >1000 runs not a single segfault or sigill!
I know but unfortunately I’m still stuck on how to actually solve this, plus big job change and move has kept me busy irl 😢 I’ll dedicate serious time to this this weekend and see if I can make any progress past my multiple not working experiments
So today I think I came up with a solution to this problem and potentially #207 and #35. But this is going to be a reasonable change in the state machine so may not drop this week.
Essentially, instead of getting the latest signal, handling it, collecting coverage then resuming I’m going to get all available signals and do two passes of the list.
More technical explanation for the interested below
Essentially if we have an instruction where the opcode is more than 1 byte e.g.
MOV
and instrument it we write the interrupt instruction to the start of the word. ThereforeMOV r/m8,r8
which is88 /r
could becomeCC /r
.Then when we get the interrupt the program counter is at the
\r
so we replace CC with the original instruction, move the program counter back 1 and then execute the original instruction either with aptrace
continue or a step.However, there’s a problem with this. If the interrupt is hit by two threads simultaneously then the old way I’d re-enable the instruction, move the first thread back and then use
ptrace
to continue. Butptrace
continues and steps are applied to all threads in the process. So now we have a second thread that’s in the middle of an instruction which then tries to execute an invalid instruction.By calling waitpid until it stops giving me signals however I can build a view up of all the threads that have been trapped and when those traps would cause this issue and solve it. In theory.
This is still just a working theory that I’m in the process of coding up to test. If anyone sees any mistakes in my reasoning or has any additional ideas feel free to lend your 2 cents 👍
https://xd009642.github.io/2019/10/02/Tarpaulin-and-the-futures.html
So I thought one of the patches to get coz-rs working might solve this (increasing the signal stack as overflowing it causes segfaults). Didn’t work but the process of trying it lead me to get some stats on how often this happens. I’m observing it 11% of runs with the minimum example from before:
Hopefully now with a test for this in one of my local branches and vague statistics on how often it occurs I’ll be able to find when I’m on the right track easier! Current work is on the branch segfault-investigation
Ok so ran on rust-evmap as @jonhoo mentioned it previously. With the change in the branch
affinity
did 100 runs with no SIGILL or SIGSEGV. Running on the current latest tarpaulin on crates.io and I can already see some failures before finishing the 100 runs.So yeah if anyone wants to try it out and report back that would be appreciated!
Edit results for the latest release. Failed 11/100 times!
While there might be more issues with this ^, I just wanted to say thanks a lot for all the hard work on this release. This fixed all the issues I had with tarpaulin on a mid-sized project that had heavy use of rayon. It still crashes with --test-threads 1, but removing that flag actually worked 👍
New version 0.9.0 is being released as well speak and docker image will be updated as part of that process once the really long travis build finishes!
It’s been “fun” but it’s now time to close this issue for good!
@vberger I’ll do a more thorough write up and consider posting it as a blog or just put it here. But this fix is two parts, the first is the queue of pending wait signals that I did a few months ago which fixed a big tarpaulin issue with multi-threading and is what allowed me to remove the test_threads=1 limitation. The test_threads 1 is based on how libtest has changed it’s test runner and I’ll also go into details on that 😄
I’m now running tarpaulin on a project with 9 tests using thread pools 100 times. Once this is done if there are no failures I’ll merge the branch into develop
@xd009642 It’s okay, life happens, and your well-being is the most important! I’m sure if you gave some instructions for how others can help track this down, there are some people here who’d love to help 😃
@xd009642 I think this is a pretty high priority issue. It is the thing that’s preventing me from using tarpaulin in nearly all my crates at the moment.
I just felt it was necessary to comment on this 😃
I’m sure this is being worked on as best as possible in the limited time everyone participating in this FOSS project has, so sit tight and wait for a fix to arrive please.
Also, next time you report a bug, please include more info: What code is this being run on? What’s the complete log of tarpaulin in debug mode? What version of tarpaulin are you using, and so on.
I’ll try that tonight. And have fun at the conference 👍
So there’s a chance the SIGILL and SIGSEGV might be the same issue just manifesting in slightly different ways. I’ve got an experiment currently running to see if I’ve made any progress or not. If I have I’ll push the branch and ask people here to test it out on their own projects 👀
Although this is being reopened, my build now works beautifully with the new change and my project makes some pretty extensive use of the futures ecosystem. Amazing work
https://dev.azure.com/toshi-search/toshi-search/_build/results?buildId=354
It’s on the develop tag, master coming in tonight since there’s been no reported regressions
So I just got output that passes no segfaults for the first time! Still analysing results and figuring it out but here’s a picture for interested parties of the run
It didn’t make an improvement at all… To try and aid in debugging for me and anyone who wants to take a crack of it I’ve gutted tarpaulin keeping only the ptrace and statemachine stuff and added some code to push events into a timeline of what happens and generate a timeline with gnuplot https://github.com/xd009642/minitarp it’s still very much a WIP but here’s some example output with the futures project in
tests/data/futures
in the branchsegfault-investigation
.Currently, the most interesting thing to me is why some threads appear but I don’t appear to get a ptrace clone event 😕
So you can’t debug tarpaulin with gdb because ptrace can’t trace ptrace using the --debug flag on tarpaulin and the dwarf dumps/disassembly from objdump on the test binary try to reconstruct the behaviour of the program and work out where there could be essentially threading conflicts (because binary is multithreaded but ptrace gives you a single threaded view into it).
Another route is working out how kcov and tarpaulin differ and seeing if the differences could be the root of the problem. The developer wiki ptrace section may be of help. And all the previous comments in this thread
So as a slight update I’ve added extra debug logging to help diagnose internal issues in tarpaulin activated via the
--debug
flag (develop branch only for now). It can spew a ton of info so anyone who wants to add anything to the issue attach a file if it’s long or post a link to a gist.For all the examples on this issue which recreated the issue for me, occasionally I had a segfault and other times tarpaulin just ended up hanging and had to be killed. I’ve figured out what was causing it to hang! I was assuming anything that wasn’t a trap or a segfault was a ignoreable/forwardable signal. The times tarpaulin was hanging was because of a SIGILL - illegal instruction.
Hopefully, now I’ve got an area to focus in on it shouldn’t take me too long to resolve this!