tarpaulin: Instrumentation address clash errors and segfaults

as seen in failing builds https://circleci.com/gh/holochain/holochain-rust/1514

using 0.6.11 and nightly-2018-12-26

i’ve tried various flags passed to tarpaulin, but builds always seem to fail with this error

also seeing Error a segfault occured when executing test

potentially related https://github.com/xd009642/tarpaulin/issues/35 as we are using threads

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 83 (42 by maintainers)

Commits related to this issue

Most upvoted comments

And it passed! I’ll keep it in develop until the 3rd and then I’ll merge it, just to give anyone a chance to find issues. I’m very happy with this currently though I’m seeing tarpaulin do things it’s never done before 😄

Nice! So as this issue ended up resolving two different issues I’m closing it and if there’s another segfault or sigill a new issue should be opened 🎉

Can’t wait for the fix ! Meantime, I found a workaround by forcing execution to a single core using taskset -c 0 cargo tarpaulin ... or forcing Docker to a single core using docker run --cpuset-cpus="0" ...

It’s much slower obviously, but still better than having no coverage report.

Yeah that was it. And I’m doing a run of 100. With latest master tarpaulin it segfaulted every time I tried. So far this fix has worked every time (but that’s just 3 times). I’ll let it get through all 100 runs but if it passes that I’m merging and might do a release tonight 👀

Edit: Release tonight is coming, >1000 runs not a single segfault or sigill!

I know but unfortunately I’m still stuck on how to actually solve this, plus big job change and move has kept me busy irl 😢 I’ll dedicate serious time to this this weekend and see if I can make any progress past my multiple not working experiments

So today I think I came up with a solution to this problem and potentially #207 and #35. But this is going to be a reasonable change in the state machine so may not drop this week.

Essentially, instead of getting the latest signal, handling it, collecting coverage then resuming I’m going to get all available signals and do two passes of the list.

More technical explanation for the interested below

Essentially if we have an instruction where the opcode is more than 1 byte e.g. MOV and instrument it we write the interrupt instruction to the start of the word. Therefore MOV r/m8,r8 which is 88 /r could become CC /r.

Then when we get the interrupt the program counter is at the \r so we replace CC with the original instruction, move the program counter back 1 and then execute the original instruction either with a ptrace continue or a step.

However, there’s a problem with this. If the interrupt is hit by two threads simultaneously then the old way I’d re-enable the instruction, move the first thread back and then use ptrace to continue. But ptrace continues and steps are applied to all threads in the process. So now we have a second thread that’s in the middle of an instruction which then tries to execute an invalid instruction.

By calling waitpid until it stops giving me signals however I can build a view up of all the threads that have been trapped and when those traps would cause this issue and solve it. In theory.

This is still just a working theory that I’m in the process of coding up to test. If anyone sees any mistakes in my reasoning or has any additional ideas feel free to lend your 2 cents 👍

So I thought one of the patches to get coz-rs working might solve this (increasing the signal stack as overflowing it causes segfaults). Didn’t work but the process of trying it lead me to get some stats on how often this happens. I’m observing it 11% of runs with the minimum example from before:

#![feature(async_await, await_macro, futures_api)]

#[test]
pub fn a() {
    futures::executor::ThreadPool::new();
}

#[test]
pub fn b() {
    futures::executor::ThreadPool::new();
}

Hopefully now with a test for this in one of my local branches and vague statistics on how often it occurs I’ll be able to find when I’m on the right track easier! Current work is on the branch segfault-investigation

Ok so ran on rust-evmap as @jonhoo mentioned it previously. With the change in the branch affinity did 100 runs with no SIGILL or SIGSEGV. Running on the current latest tarpaulin on crates.io and I can already see some failures before finishing the 100 runs.

So yeah if anyone wants to try it out and report back that would be appreciated!

Edit results for the latest release. Failed 11/100 times!

While there might be more issues with this ^, I just wanted to say thanks a lot for all the hard work on this release. This fixed all the issues I had with tarpaulin on a mid-sized project that had heavy use of rayon. It still crashes with --test-threads 1, but removing that flag actually worked 👍

New version 0.9.0 is being released as well speak and docker image will be updated as part of that process once the really long travis build finishes!

It’s been “fun” but it’s now time to close this issue for good!

@vberger I’ll do a more thorough write up and consider posting it as a blog or just put it here. But this fix is two parts, the first is the queue of pending wait signals that I did a few months ago which fixed a big tarpaulin issue with multi-threading and is what allowed me to remove the test_threads=1 limitation. The test_threads 1 is based on how libtest has changed it’s test runner and I’ll also go into details on that 😄

I’m now running tarpaulin on a project with 9 tests using thread pools 100 times. Once this is done if there are no failures I’ll merge the branch into develop

@xd009642 It’s okay, life happens, and your well-being is the most important! I’m sure if you gave some instructions for how others can help track this down, there are some people here who’d love to help 😃

@xd009642 I think this is a pretty high priority issue. It is the thing that’s preventing me from using tarpaulin in nearly all my crates at the moment.

I just felt it was necessary to comment on this 😃

Please help me out ASAP so that I can continue my work.

I’m sure this is being worked on as best as possible in the limited time everyone participating in this FOSS project has, so sit tight and wait for a fix to arrive please.

Also, next time you report a bug, please include more info: What code is this being run on? What’s the complete log of tarpaulin in debug mode? What version of tarpaulin are you using, and so on.

I’ll try that tonight. And have fun at the conference 👍

So there’s a chance the SIGILL and SIGSEGV might be the same issue just manifesting in slightly different ways. I’ve got an experiment currently running to see if I’ve made any progress or not. If I have I’ll push the branch and ask people here to test it out on their own projects 👀

Although this is being reopened, my build now works beautifully with the new change and my project makes some pretty extensive use of the futures ecosystem. Amazing work

https://dev.azure.com/toshi-search/toshi-search/_build/results?buildId=354

It’s on the develop tag, master coming in tonight since there’s been no reported regressions

So I just got output that passes no segfaults for the first time! Still analysing results and figuring it out but here’s a picture for interested parties of the run

output_pass

It didn’t make an improvement at all… To try and aid in debugging for me and anyone who wants to take a crack of it I’ve gutted tarpaulin keeping only the ptrace and statemachine stuff and added some code to push events into a timeline of what happens and generate a timeline with gnuplot https://github.com/xd009642/minitarp it’s still very much a WIP but here’s some example output with the futures project in tests/data/futures in the branch segfault-investigation.

Currently, the most interesting thing to me is why some threads appear but I don’t appear to get a ptrace clone event 😕

output

So you can’t debug tarpaulin with gdb because ptrace can’t trace ptrace using the --debug flag on tarpaulin and the dwarf dumps/disassembly from objdump on the test binary try to reconstruct the behaviour of the program and work out where there could be essentially threading conflicts (because binary is multithreaded but ptrace gives you a single threaded view into it).

Another route is working out how kcov and tarpaulin differ and seeing if the differences could be the root of the problem. The developer wiki ptrace section may be of help. And all the previous comments in this thread

So as a slight update I’ve added extra debug logging to help diagnose internal issues in tarpaulin activated via the --debug flag (develop branch only for now). It can spew a ton of info so anyone who wants to add anything to the issue attach a file if it’s long or post a link to a gist.

For all the examples on this issue which recreated the issue for me, occasionally I had a segfault and other times tarpaulin just ended up hanging and had to be killed. I’ve figured out what was causing it to hang! I was assuming anything that wasn’t a trap or a segfault was a ignoreable/forwardable signal. The times tarpaulin was hanging was because of a SIGILL - illegal instruction.

Hopefully, now I’ve got an area to focus in on it shouldn’t take me too long to resolve this!