simplecov: I am experiencing race condition while using SimpleCov with Parallel_tests

Repro: I’m using simplecov (0.13.0), ruby "2.3.3", parallel_tests (2.13.0)

# In spec_helper.rb:
if ENV['COVERAGE']
  require 'simplecov'
  process_name = "process##{Process.pid.to_s}:#{ENV['TEST_ENV_NUMBER'] || '1'}"
  puts "Starting Code Coverage in #{process_name}"
  SimpleCov.command_name process_name
  SimpleCov.merge_timeout 120 # Expire parallel process coverage cache after x seconds
  SimpleCov.start 'rails'
  SimpleCov.refuse_coverage_drop
  SimpleCov.maximum_coverage_drop 1
  SimpleCov.minimum_coverage 100
end

Ran:

COVERAGE=true RAILS_ENV=test bundle exec rake parallel:spec

Expected: The last of the 8 processes reports 100% coverage, which should make SimpleCov pass.

Actual: The result of the 3rd-last process was picked as the merged result!

...
Coverage report generated for process#1306:3, process#1311:7, process#1313:8, process#1317:4, process#1318:5, process#1319:6 to /Users/amin/api/coverage. 1713 / 1801 LOC (95.11%) covered.
Coverage report generated for process#1306:3, process#1309:, process#1311:7, process#1313:8, process#1315:2, process#1317:4, process#1318:5, process#1319:6 to /Users/amin/api/coverage. 1754 / 1801 LOC (97.39%) covered.
Coverage report generated for process#1306:3, process#1309:, process#1311:7, process#1313:8, process#1315:2, process#1317:4, process#1318:5, process#1319:6 to /Users/amin/api/coverage. 1801 / 1801 LOC (100.0%) covered.
Coverage (95.11%) is below the expected minimum coverage (100.00%).

287 examples, 0 failures

Took 35 seconds
Tests Failed

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 24

Most upvoted comments

SOLVED!

The SimpleCov race condition is definitely a bug. After 16 hours wasted, I came up with my own UNIX solution, which I will describe to anybody stuck with a parallel_test + SimpleCov mysterious failure problem:

  1. Your testing script should wipe previous coverage like:
rm -rf ./coverage
COVERAGE=true RAILS_ENV=test bundle exec rake parallel:spec
  1. Use the thread-safety of appending to Unix files to (a) count the number of processes starting, (b) count how many processes ended, and © whether your current ending process is the last process. Then if you’re the last process, see if you have reached the threshold:
# In spec_helper.rb
if ENV['COVERAGE']
  require 'simplecov'
  process_name = "process_#{ENV['TEST_ENV_NUMBER']}##{Process.pid.to_s}"
  SimpleCov.command_name process_name
  threshold = 100
  SimpleCov.minimum_coverage 0
  SimpleCov.start 'rails' do
    %x( mkdir -p ./coverage/ )
    %x( echo "#{process_name}" >> ./coverage/.started )
  end
  SimpleCov.at_exit do
    SimpleCov.result.format!
    actual = SimpleCov.result.covered_percent.to_i
    success = (actual >= threshold)
    status = success ? "DONE" : "..."
    %x( echo "#{process_name}: #{actual} of #{threshold} #{status}" >> ./coverage/.done )
    processes_started = %x[ cat ./coverage/.started | wc -l ].to_i
    processes_done = %x[ cat ./coverage/.done | wc -l ].to_i
    if (processes_started == processes_done)
      print "TOTAL COVERAGE: #{actual} of #{threshold} -"
      if success
        puts "\e[0;32m SUCCESS! \e[0m\n"
        exit 0
      else
        puts "\e[0;31m FAIL! \e[0m\n"
        exit 1
      end
    end
  end
end

Note, the minimum_coverage of 0 is important to trick SimpleCov to calculate coverage. But the our constant threshold is getting compared to actual only when the last process completes, and doesn’t rely in any way on SimpleCov buggy logic.

Result for 110% threshold, as expected:

TOTAL COVERAGE: 100 of 110 - FAIL! 

287 examples, 0 failures
Tests Failed

Result for 100% threshold, as expected:

TOTAL COVERAGE: 100 of 100 - SUCCESS! 

287 examples, 0 failures

You can take that to the bank! 😃

@PragTob closing this out as I’ve found the issue. The issue does not appear to be one with SimpleCov — at least, not directly. We finally tracked the issue down to the SimpleCov::ResultsMerger.results method — which iterates through the stored results and drops out those that don’t meet the merge timeout threshold.

What we saw was that the final process was dropping all but the most recent results from the merged results, and tracked that down to Time.now - result.created_at returning a wildly different value in the final process. In the end, the culprit was Timecop, and a test where the time-travel function was never reset. When running in our CI, that test suite was isolated from others, which meant for the final process, Time.now was travelled 2 weeks into the future.

In order to see this behavior we had to have several minor coincidences — we had an error in Timecop usage, that test got chunked into a parallel process that had no other Timecop tests (avoiding something else resetting it), and that chunk was reliably the last one to finish.

I don’t think there’s any change needed to SimpleCov to address this, except maybe a note in the README that any code that affects the value of Time.now (i.e. Timecop) will potentially cause incorrect coverage reports.

tl;dr messing with time causes unforeseen issues

In the last year since 2017, I’ve written many more tests, causing my processes to get loads more unevenly. My solution above does fail from time to time. I have found multiple underlying causes. For example, SimpleCov is like Schroeder’s cat in that if you look at “SimpleCov.result.covered_percent” at the wrong time, you might alter the outcome.

Also it seems like ParallelTests puts the most load on the last process. I’m now using this heuristic to avoid the race condition by having the last starting process be the one waiting for others. The following solution seems to run into LESS RACE CONDITIONS than my previous (upvoted) solution above, and it’s also less verbose and more elegant, though I won’t guarantee that it completely avoids race conditions:

if ENV['COVERAGE']
  require 'simplecov'
  SimpleCov.command_name "process_#{ENV['TEST_ENV_NUMBER']}##{Process.pid.to_s}"
  SimpleCov.start 'rails'
  SimpleCov.at_exit do
    if ParallelTests.last_process?
      ParallelTests.wait_for_other_processes_to_finish
      SimpleCov.result.format!
      actual = SimpleCov.result.covered_percent.round(2)
      print "TOTAL COVERAGE: #{actual}% -"
      if (actual == 100)
        puts "\e[0;32m SUCCESS! \e[0m\n"
      else
        puts "\e[0;31m FAIL! \e[0m\n"
        exit 1
      end
    end
  end
end

@wkirby thank you so much for looking into it so much 💚

That… does sound quite gruel overall as in hard to track down. I’d wish there was something I know we could do, maybe I’ll come up with something down the line but right now I can think of nothing off the top of my head.

Or well… maybe I do. We maybe use monotonic time. Hm.

I should leave one more comment for anybody looking at the history of PRs: using none of the past versions of SimpleCov solved this problem for me, whereas my own workaround did. So it’s not a regression bug; it’s a real bug that has never been fixed, at least for someone running a Macbook Pro, Rails 5, SimpleCov on Parallel_tests.

Double confirmation: My Travis Continuous Integration (running Unix) also is passing now, using the workaround I provided above.

Build use to take 10 minutes, now it takes 2 minutes!