tarpaulin: Instrumentation address clash errors and segfaults

as seen in failing builds https://circleci.com/gh/holochain/holochain-rust/1514

using 0.6.11 and nightly-2018-12-26

i’ve tried various flags passed to tarpaulin, but builds always seem to fail with this error

also seeing Error a segfault occured when executing test

potentially related https://github.com/xd009642/tarpaulin/issues/35 as we are using threads

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 83 (42 by maintainers)

Commits related to this issue

temporarily disable code coverage on travis rust-lang/rust#52478 xd009642/tarpaulin#161 xd009642/tarpaulin#190 — committed to brunocodutra/reducer by brunocodutra 5 years ago
temporarily disable code coverage on travis rust-lang/rust#52478 xd009642/tarpaulin#161 xd009642/tarpaulin#190 — committed to brunocodutra/reducer by brunocodutra 5 years ago
temporarily disable code coverage on travis rust-lang/rust#52478 xd009642/tarpaulin#161 xd009642/tarpaulin#190 — committed to brunocodutra/reducer by brunocodutra 5 years ago
temporarily disable code coverage on travis rust-lang/rust#52478 xd009642/tarpaulin#161 xd009642/tarpaulin#190 — committed to brunocodutra/reducer by brunocodutra 5 years ago
travis: disable coverage on tests with threads Spawning threads in tests hits a tarpaulin bug preventing collection of coverage, so disable them for now. cf https://github.com/xd009642/tarpaulin/iss... — committed to Smithay/wayland-rs by elinorbgr 5 years ago
chore(ci): don't run coverage tests for now They fail due to https://github.com/xd009642/tarpaulin/issues/190 — committed to Cogitri/tmplgen by deleted user 5 years ago
chore(ci): don't run coverage tests for now They fail due to https://github.com/xd009642/tarpaulin/issues/190 — committed to Cogitri/tmplgen by deleted user 5 years ago
travis: run tarpaulin on stable tarpaulin can now again be compiled on stable, and when run on stable it does not yet show the multithreading bug. see https://github.com/xd009642/tarpaulin/issues/19... — committed to Smithay/wayland-rs by elinorbgr 5 years ago
travis: run tarpaulin on stable tarpaulin can now again be compiled on stable, and when run on stable it does not yet show the multithreading bug. see https://github.com/xd009642/tarpaulin/issues/19... — committed to Smithay/wayland-rs by elinorbgr 5 years ago
Working on #190 and #207 The two issues seem related so grouping all work here. From what I've found there's some issues with bad jumps or addresses during the run which can cause these errors either... — committed to xd009642/tarpaulin by xd009642 5 years ago
Merge branch 'event_queue' into develop This fixes #190 and potentially some other things. — committed to xd009642/tarpaulin by xd009642 5 years ago
Travis CI: Prevent acceptor tests from being run by tarpaulin The tests require creating threadpools, which will not work until https://github.com/xd009642/tarpaulin/issues/190 is fixed. — committed to str4d/ire by str4d 5 years ago
Travis CI: Prevent acceptor tests from being run by tarpaulin The tests require creating threadpools, which will not work until https://github.com/xd009642/tarpaulin/issues/190 is fixed. — committed to str4d/ire by str4d 5 years ago
Ignore coverage for now It keeps failing due to xd009642/tarpaulin#190. — committed to jonhoo/inferno by jonhoo 5 years ago
Disable coverage for now It keeps failing due to xd009642/tarpaulin#190 — committed to jonhoo/left-right by jonhoo 5 years ago
Don't run tarpaulin with '--test-threads=1' See https://github.com/xd009642/tarpaulin/issues/190 and https://xd009642.github.io/2019/10/02/Tarpaulin-and-the-futures.html — committed to nbigaouette/hygeia by nbigaouette 5 years ago

Most upvoted comments

And it passed! I’ll keep it in develop until the 3rd and then I’ll merge it, just to give anyone a chance to find issues. I’m very happy with this currently though I’m seeing tarpaulin do things it’s never done before 😄

xd009642 on Oct 1, 2019

Nice! So as this issue ended up resolving two different issues I’m closing it and if there’s another segfault or sigill a new issue should be opened 🎉

xd009642 on Nov 1, 2019

Can’t wait for the fix ! Meantime, I found a workaround by forcing execution to a single core using taskset -c 0 cargo tarpaulin ... or forcing Docker to a single core using docker run --cpuset-cpus="0" ...

It’s much slower obviously, but still better than having no coverage report.

appaquet on Mar 16, 2019

Yeah that was it. And I’m doing a run of 100. With latest master tarpaulin it segfaulted every time I tried. So far this fix has worked every time (but that’s just 3 times). I’ll let it get through all 100 runs but if it passes that I’m merging and might do a release tonight 👀

Edit: Release tonight is coming, >1000 runs not a single segfault or sigill!

xd009642 on Oct 30, 2019

I know but unfortunately I’m still stuck on how to actually solve this, plus big job change and move has kept me busy irl 😢 I’ll dedicate serious time to this this weekend and see if I can make any progress past my multiple not working experiments

xd009642 on Sep 18, 2019

So today I think I came up with a solution to this problem and potentially #207 and #35. But this is going to be a reasonable change in the state machine so may not drop this week.

Essentially, instead of getting the latest signal, handling it, collecting coverage then resuming I’m going to get all available signals and do two passes of the list.

More technical explanation for the interested below

Essentially if we have an instruction where the opcode is more than 1 byte e.g. MOV and instrument it we write the interrupt instruction to the start of the word. Therefore MOV r/m8,r8 which is 88 /r could become CC /r.

Then when we get the interrupt the program counter is at the \r so we replace CC with the original instruction, move the program counter back 1 and then execute the original instruction either with a ptrace continue or a step.

However, there’s a problem with this. If the interrupt is hit by two threads simultaneously then the old way I’d re-enable the instruction, move the first thread back and then use ptrace to continue. But ptrace continues and steps are applied to all threads in the process. So now we have a second thread that’s in the middle of an instruction which then tries to execute an invalid instruction.

By calling waitpid until it stops giving me signals however I can build a view up of all the threads that have been trapped and when those traps would cause this issue and solve it. In theory.

This is still just a working theory that I’m in the process of coding up to test. If anyone sees any mistakes in my reasoning or has any additional ideas feel free to lend your 2 cents 👍

xd009642 on Mar 15, 2019

https://xd009642.github.io/2019/10/02/Tarpaulin-and-the-futures.html

xd009642 on Oct 2, 2019

So I thought one of the patches to get coz-rs working might solve this (increasing the signal stack as overflowing it causes segfaults). Didn’t work but the process of trying it lead me to get some stats on how often this happens. I’m observing it 11% of runs with the minimum example from before:

#![feature(async_await, await_macro, futures_api)]

#[test]
pub fn a() {
    futures::executor::ThreadPool::new();
}

#[test]
pub fn b() {
    futures::executor::ThreadPool::new();
}

Hopefully now with a test for this in one of my local branches and vague statistics on how often it occurs I’ll be able to find when I’m on the right track easier! Current work is on the branch segfault-investigation

xd009642 on Sep 24, 2019

Ok so ran on rust-evmap as @jonhoo mentioned it previously. With the change in the branch affinity did 100 runs with no SIGILL or SIGSEGV. Running on the current latest tarpaulin on crates.io and I can already see some failures before finishing the 100 runs.

So yeah if anyone wants to try it out and report back that would be appreciated!

Edit results for the latest release. Failed 11/100 times!

xd009642 on Oct 28, 2019

While there might be more issues with this ^, I just wanted to say thanks a lot for all the hard work on this release. This fixed all the issues I had with tarpaulin on a mid-sized project that had heavy use of rayon. It still crashes with --test-threads 1, but removing that flag actually worked 👍

clux on Oct 4, 2019

New version 0.9.0 is being released as well speak and docker image will be updated as part of that process once the really long travis build finishes!

It’s been “fun” but it’s now time to close this issue for good!

xd009642 on Oct 3, 2019

@vberger I’ll do a more thorough write up and consider posting it as a blog or just put it here. But this fix is two parts, the first is the queue of pending wait signals that I did a few months ago which fixed a big tarpaulin issue with multi-threading and is what allowed me to remove the test_threads=1 limitation. The test_threads 1 is based on how libtest has changed it’s test runner and I’ll also go into details on that 😄

xd009642 on Oct 2, 2019

I’m now running tarpaulin on a project with 9 tests using thread pools 100 times. Once this is done if there are no failures I’ll merge the branch into develop

xd009642 on Oct 1, 2019

@xd009642 It’s okay, life happens, and your well-being is the most important! I’m sure if you gave some instructions for how others can help track this down, there are some people here who’d love to help 😃

jonhoo on Sep 18, 2019

@xd009642 I think this is a pretty high priority issue. It is the thing that’s preventing me from using tarpaulin in nearly all my crates at the moment.

jonhoo on Sep 17, 2019

I just felt it was necessary to comment on this 😃

Please help me out ASAP so that I can continue my work.

I’m sure this is being worked on as best as possible in the limited time everyone participating in this FOSS project has, so sit tight and wait for a fix to arrive please.

Also, next time you report a bug, please include more info: What code is this being run on? What’s the complete log of tarpaulin in debug mode? What version of tarpaulin are you using, and so on.

Cogitri on Apr 2, 2019

I’ll try that tonight. And have fun at the conference 👍

xd009642 on Oct 30, 2019

So there’s a chance the SIGILL and SIGSEGV might be the same issue just manifesting in slightly different ways. I’ve got an experiment currently running to see if I’ve made any progress or not. If I have I’ll push the branch and ask people here to test it out on their own projects 👀

xd009642 on Oct 28, 2019

Although this is being reopened, my build now works beautifully with the new change and my project makes some pretty extensive use of the futures ecosystem. Amazing work

https://dev.azure.com/toshi-search/toshi-search/_build/results?buildId=354

hntd187 on Oct 5, 2019

It’s on the develop tag, master coming in tonight since there’s been no reported regressions

xd009642 on Oct 3, 2019

So I just got output that passes no segfaults for the first time! Still analysing results and figuring it out but here’s a picture for interested parties of the run

output_pass

xd009642 on Oct 1, 2019

It didn’t make an improvement at all… To try and aid in debugging for me and anyone who wants to take a crack of it I’ve gutted tarpaulin keeping only the ptrace and statemachine stuff and added some code to push events into a timeline of what happens and generate a timeline with gnuplot https://github.com/xd009642/minitarp it’s still very much a WIP but here’s some example output with the futures project in tests/data/futures in the branch segfault-investigation.

Currently, the most interesting thing to me is why some threads appear but I don’t appear to get a ptrace clone event 😕

output

xd009642 on Sep 29, 2019

So you can’t debug tarpaulin with gdb because ptrace can’t trace ptrace using the --debug flag on tarpaulin and the dwarf dumps/disassembly from objdump on the test binary try to reconstruct the behaviour of the program and work out where there could be essentially threading conflicts (because binary is multithreaded but ptrace gives you a single threaded view into it).

Another route is working out how kcov and tarpaulin differ and seeing if the differences could be the root of the problem. The developer wiki ptrace section may be of help. And all the previous comments in this thread

xd009642 on Sep 18, 2019

So as a slight update I’ve added extra debug logging to help diagnose internal issues in tarpaulin activated via the --debug flag (develop branch only for now). It can spew a ton of info so anyone who wants to add anything to the issue attach a file if it’s long or post a link to a gist.

For all the examples on this issue which recreated the issue for me, occasionally I had a segfault and other times tarpaulin just ended up hanging and had to be killed. I’ve figured out what was causing it to hang! I was assuming anything that wasn’t a trap or a segfault was a ignoreable/forwardable signal. The times tarpaulin was hanging was because of a SIGILL - illegal instruction.

Hopefully, now I’ve got an area to focus in on it shouldn’t take me too long to resolve this!

xd009642 on Feb 6, 2019