simplecov: I am experiencing race condition while using SimpleCov with Parallel_tests
Repro:
I’m using simplecov (0.13.0), ruby "2.3.3", parallel_tests (2.13.0)
# In spec_helper.rb:
if ENV['COVERAGE']
require 'simplecov'
process_name = "process##{Process.pid.to_s}:#{ENV['TEST_ENV_NUMBER'] || '1'}"
puts "Starting Code Coverage in #{process_name}"
SimpleCov.command_name process_name
SimpleCov.merge_timeout 120 # Expire parallel process coverage cache after x seconds
SimpleCov.start 'rails'
SimpleCov.refuse_coverage_drop
SimpleCov.maximum_coverage_drop 1
SimpleCov.minimum_coverage 100
end
Ran:
COVERAGE=true RAILS_ENV=test bundle exec rake parallel:spec
Expected: The last of the 8 processes reports 100% coverage, which should make SimpleCov pass.
Actual: The result of the 3rd-last process was picked as the merged result!
...
Coverage report generated for process#1306:3, process#1311:7, process#1313:8, process#1317:4, process#1318:5, process#1319:6 to /Users/amin/api/coverage. 1713 / 1801 LOC (95.11%) covered.
Coverage report generated for process#1306:3, process#1309:, process#1311:7, process#1313:8, process#1315:2, process#1317:4, process#1318:5, process#1319:6 to /Users/amin/api/coverage. 1754 / 1801 LOC (97.39%) covered.
Coverage report generated for process#1306:3, process#1309:, process#1311:7, process#1313:8, process#1315:2, process#1317:4, process#1318:5, process#1319:6 to /Users/amin/api/coverage. 1801 / 1801 LOC (100.0%) covered.
Coverage (95.11%) is below the expected minimum coverage (100.00%).
287 examples, 0 failures
Took 35 seconds
Tests Failed
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 24
SOLVED!
The
SimpleCovrace condition is definitely a bug. After 16 hours wasted, I came up with my own UNIX solution, which I will describe to anybody stuck with aparallel_test+SimpleCovmysterious failure problem:Note, the
minimum_coverageof0is important to trick SimpleCov to calculate coverage. But the our constantthresholdis getting compared toactualonly when the last process completes, and doesn’t rely in any way on SimpleCov buggy logic.Result for 110% threshold, as expected:
Result for 100% threshold, as expected:
You can take that to the bank! 😃
@PragTob closing this out as I’ve found the issue. The issue does not appear to be one with
SimpleCov— at least, not directly. We finally tracked the issue down to theSimpleCov::ResultsMerger.resultsmethod — which iterates through the stored results and drops out those that don’t meet the merge timeout threshold.What we saw was that the final process was dropping all but the most recent results from the merged results, and tracked that down to
Time.now - result.created_atreturning a wildly different value in the final process. In the end, the culprit wasTimecop, and a test where the time-travel function was never reset. When running in our CI, that test suite was isolated from others, which meant for the final process,Time.nowwas travelled 2 weeks into the future.In order to see this behavior we had to have several minor coincidences — we had an error in
Timecopusage, that test got chunked into a parallel process that had no otherTimecoptests (avoiding something else resetting it), and that chunk was reliably the last one to finish.I don’t think there’s any change needed to
SimpleCovto address this, except maybe a note in theREADMEthat any code that affects the value ofTime.now(i.e.Timecop) will potentially cause incorrect coverage reports.tl;dr messing with time causes unforeseen issues
In the last year since 2017, I’ve written many more tests, causing my processes to get loads more unevenly. My solution above does fail from time to time. I have found multiple underlying causes. For example, SimpleCov is like Schroeder’s cat in that if you look at “SimpleCov.result.covered_percent” at the wrong time, you might alter the outcome.
Also it seems like ParallelTests puts the most load on the last process. I’m now using this heuristic to avoid the race condition by having the last starting process be the one waiting for others. The following solution seems to run into LESS RACE CONDITIONS than my previous (upvoted) solution above, and it’s also less verbose and more elegant, though I won’t guarantee that it completely avoids race conditions:
@wkirby thank you so much for looking into it so much 💚
That… does sound quite gruel overall as in hard to track down. I’d wish there was something I know we could do, maybe I’ll come up with something down the line but right now I can think of nothing off the top of my head.
Or well… maybe I do. We maybe use monotonic time. Hm.
I should leave one more comment for anybody looking at the history of PRs: using none of the past versions of SimpleCov solved this problem for me, whereas my own workaround did. So it’s not a regression bug; it’s a real bug that has never been fixed, at least for someone running a Macbook Pro, Rails 5, SimpleCov on Parallel_tests.
Double confirmation: My Travis Continuous Integration (running Unix) also is passing now, using the workaround I provided above.
Build use to take 10 minutes, now it takes 2 minutes!