coveragepy: Process hang with Coverage 6.3

Describe the bug A clear and concise description of the bug.

We’ve been having issues with our CI in GitHub Actions for the last few hours, and think it might be because of Coverage 6.3 - it’s the one thing that’s changed, and freezing it at 6.2 seems to allow runs to complete successfully.

To Reproduce How can we reproduce the problem? Please be specific. Don’t link to a failing CI job. Answer the questions below:

  1. What version of Python are you using? 3.7 seems to be a bit buggy with this. We also run 3.8 - 3.10 in CI and no issues seen with that.
  2. What version of coverage.py shows the problem? The output of coverage debug sys is helpful. 6.2.3
  3. What versions of what packages do you have installed? The output of pip freeze is helpful. See https://gist.github.com/tunetheweb/4d288ea4467ba74a66b3a0e2e8d5e4ea
  4. What code shows the problem? Give us a specific commit of a specific repo that we can check out. If you’ve already worked around the problem, please provide a commit before that fix. This is tricky. We run a lot of commands in CI, but checking out https://github.com/sqlfluff/sqlfluff/ and running tox -e py37 -- -n 2 test should reproduce it. Having problems setting up a 3.7 environment but will try to get a better test case. We do use a multithreaded process and noticed some changes to that.
  5. What commands did you run?

Expected behavior A clear and concise description of what you expected to happen.

Additional context Add any other context about the problem here.

Will try to get a better repo.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 24
  • Comments: 74 (22 by maintainers)

Commits related to this issue

Most upvoted comments

This is now released as part of coverage 6.4.

I’ve made the SIGTERM handler opt-in, so these issues should now be fixed. Commited in 803a5494ef23187e920eeb4b42e922b87cda5966

Ironically the merge request hung. But the rerun completed OK. So concur with above that it’s better, but not solved.

This is caused by running code coverage on a function that uses multiprocessing.Pool to fork threads.

The test suite hangs on this line of coverage, specifically:

".../env/lib/python3.8/site-packages/coverage/sqldata.py", line 234 in _wrapped

Child threads are unable to acquire an available mutex; when they are created, they are given a copy of an (unavailable) mutex that is thus never updated (some info about why). Probably related to this change “Feature: coverage measurement data will now be written when a SIGTERM signal is received by the process.”

Fixed in my case by any of: using multiprocessing to “spawn” instead of forking, reverting to version 6.2.0, or turning off code coverage.

Edit: Full stack trace of stalled thread

Current thread 0x00007f5675b14740 (most recent call first):
  File ".../env/lib/python3.8/site-packages/coverage/sqldata.py", line 234 in _wrapped
  File ".../env/lib/python3.8/site-packages/coverage/collector.py", line 473 in flush_data
  File ".../env/lib/python3.8/site-packages/coverage/control.py", line 766 in get_data
  File ".../env/lib/python3.8/site-packages/coverage/control.py", line 697 in save
  File ".../env/lib/python3.8/site-packages/coverage/control.py", line 601 in _atexit
  File ".../env/lib/python3.8/site-packages/coverage/control.py", line 605 in _on_sigterm
  File ".../env/lib/python3.8/site-packages/coverage/sqldata.py", line 1156 in executescript
  File ".../env/lib/python3.8/site-packages/coverage/sqldata.py", line 299 in _init_db
  File ".../env/lib/python3.8/site-packages/coverage/sqldata.py", line 273 in _read_db
  File ".../env/lib/python3.8/site-packages/coverage/sqldata.py", line 264 in _open_db
  File ".../env/lib/python3.8/site-packages/coverage/sqldata.py", line 313 in _connect
  File ".../env/lib/python3.8/site-packages/coverage/sqldata.py", line 518 in _choose_lines_or_arcs
  File ".../env/lib/python3.8/site-packages/coverage/sqldata.py", line 459 in add_lines
  File ".../env/lib/python3.8/site-packages/coverage/sqldata.py", line 236 in _wrapped
  File ".../env/lib/python3.8/site-packages/coverage/collector.py", line 473 in flush_data
  File ".../env/lib/python3.8/site-packages/coverage/control.py", line 766 in get_data
  File ".../env/lib/python3.8/site-packages/coverage/control.py", line 697 in save
  File ".../env/lib/python3.8/site-packages/pytest_cov/embed.py", line 85 in _cleanup
  File ".../env/lib/python3.8/site-packages/pytest_cov/embed.py", line 99 in cleanup
  File "/usr/lib/python3.8/multiprocessing/util.py", line 224 in __call__
  File "/usr/lib/python3.8/multiprocessing/util.py", line 300 in _run_finalizers
  File "/usr/lib/python3.8/multiprocessing/util.py", line 334 in _exit_function
  File "/usr/lib/python3.8/multiprocessing/process.py", line 318 in _bootstrap
  File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 75 in _launch
  File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 19 in __init__
  File "/usr/lib/python3.8/multiprocessing/context.py", line 277 in _Popen
  File "/usr/lib/python3.8/multiprocessing/process.py", line 121 in start
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 326 in _repopulate_pool_static
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 303 in _repopulate_pool
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 212 in __init__
  File "/usr/lib/python3.8/multiprocessing/context.py", line 119 in Pool
  File ".../delphi_safegraph_patterns/run.py", line 108 in run_module
  File ".../tests/test_run.py", line 36 in test_output_files
  File "/usr/lib/python3.8/unittest/mock.py", line 1325 in patched
  File ".../env/lib/python3.8/site-packages/_pytest/python.py", line 183 in pytest_pyfunc_call
  File ".../env/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File ".../env/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File ".../env/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File ".../env/lib/python3.8/site-packages/_pytest/python.py", line 1641 in runtest
  File ".../env/lib/python3.8/site-packages/_pytest/runner.py", line 162 in pytest_runtest_call
  File ".../env/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File ".../env/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File ".../env/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File ".../env/lib/python3.8/site-packages/_pytest/runner.py", line 255 in <lambda>
  File ".../env/lib/python3.8/site-packages/_pytest/runner.py", line 311 in from_call
  File ".../env/lib/python3.8/site-packages/_pytest/runner.py", line 254 in call_runtest_hook
  File ".../env/lib/python3.8/site-packages/_pytest/runner.py", line 215 in call_and_report
  File ".../env/lib/python3.8/site-packages/_pytest/runner.py", line 126 in runtestprotocol
  File ".../env/lib/python3.8/site-packages/_pytest/runner.py", line 109 in pytest_runtest_protocol
  File ".../env/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File ".../env/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File ".../env/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File ".../env/lib/python3.8/site-packages/_pytest/main.py", line 348 in pytest_runtestloop
  File ".../env/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File ".../env/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File ".../env/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File ".../env/lib/python3.8/site-packages/_pytest/main.py", line 323 in _main
  File ".../env/lib/python3.8/site-packages/_pytest/main.py", line 269 in wrap_session
  File ".../env/lib/python3.8/site-packages/_pytest/main.py", line 316 in pytest_cmdline_main
  File ".../env/lib/python3.8/site-packages/pluggy/_callers.py", line 39 in _multicall
  File ".../env/lib/python3.8/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File ".../env/lib/python3.8/site-packages/pluggy/_hooks.py", line 265 in __call__
  File ".../env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 162 in main
  File ".../env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 185 in console_main
  File "../env/bin/pytest", line 8 in <module>

Thanks for the fix! Which release will this be in? 🙏

This will become a 6.4 release, though I’m not sure when. It would be great if people could do a test with the commit from GitHub:

pip install git+https://github.com/nedbat/coveragepy.git@803a5494ef23#egg=coverage

(this will claim a version of 6.3.4a0, which is fine.) If something still seems amiss, please open a new issue.

if it helps, it seems that this “patch” fixes the “bug” – or at least I can’t reproduce the deadlock after that

$ diff -u venv/lib/python3.10/site-packages/coverage/control.py venv/lib/python3.10/site-packages/coverage/control.py.old
--- venv/lib/python3.10/site-packages/coverage/control.py.old	2022-02-24 16:23:00.923351828 -0500
+++ venv/lib/python3.10/site-packages/coverage/control.py	2022-02-24 16:23:13.347352118 -0500
@@ -537,7 +537,7 @@
         # Register our clean-up handlers.
         atexit.register(self._atexit)
         is_main = (threading.current_thread() == threading.main_thread())
-        if is_main and not env.WINDOWS:
+        if False and is_main and not env.WINDOWS:
             # The Python docs seem to imply that SIGTERM works uniformly even
             # on Windows, but that's not my experience, and this agrees:
             # https://stackoverflow.com/questions/35772001/x/35792192#35792192

Seems to work for my simple test case. Thanks for being so responsive on this!

Yes I second that, this type of problem is a probably one of the worst thing to debug. Thanks again!

Seems to work for my simple test case. Thanks for being so responsive on this!

I’ve released coverage 6.3.1 with the RLock fix. I’m not sure what to do with the rest of this issue…

Here’s a pretty minimal test case that I cobbled together:

process.py:

#!/usr/bin/env python3

import multiprocessing

def calculate(num):
    for i in range(100000):
        num += num
    return num

def process(jobs=16, numbers=64):
    with multiprocessing.Pool(jobs) as pool:
        orig_numbers = range(numbers)
        new_numbers = pool.map(calculate, orig_numbers)
    return new_numbers

if __name__ == '__main__':
    numbers = process()
    print(numbers)

test_process.py:

import pytest
import process

def test_process():
    numbers = process.process()
    assert len(numbers) == 64

Install the dependencies (in a venv): pip install pytest pytest-cov coverage==6.3

Run the test with:

pytest --cov=./ test_process.py

On a 2-core machine (Ubuntu 20.04 amd64, Python 3.8.10) this hangs maybe 50% of the time. If it doesn’t, it will finish in less than 10 seconds. With coverage 6.2, it works every time.

@osma yes, a local reproduction case would be very helpful.

The SQLFluff build runs many Python versions concurrently, as well as running the dbt tests concurrently with the core tests. I don’t know if these all run in the same server or separate, but perhaps this contributes to the problem as well.

They are on separate machines. Also I did a run with the others turned off and it still hung.

I’m trying some proposed fixes. They reduce the likelihood of hangs, but it’s not 100%: https://github.com/nedbat/sqlfluff/actions

With sqlfluff a notable difference I see between the local example and the CI config is the inclusion of pytest-xdist via the -n 2 flag.

Yes I appreciate this is a very poor issue to raise. Sorry about that, but thought I’d give you a heads up as saw it right after the release, and it stopped happening right after I pinned to the old version.

Will try to give a more meaningful reproducible use case if I can narrow it down.

The issue is jobs hanging (I cancelled them after an hour, when they normally take 5 mins. One job I left running and it took 3 hours and counting). That was for Python 3.7. I also saw a lot of slow down in jobs (5 min jobs were more often than not taking 20mins when they did complete). That was for Python 3.8+.

Initially I thought GitHub Actions were on the blink but at soon as I pinned the old version of coverage it all worked. Unpinning it again breaks it again. Nothing else changes between the runs.

But as to narrowing it down why, at the moment I don’t have more info to help explain that. But will keep digging…

Think the error message is a red-herring. It was the last thing printed before it froze. But do see that at the end of the run elsewhere (though interestingly I only see it at the end of good runs, whereas that appeared after our CI job was only at 90% and then hung).

Sorry, again for such a poor report. Can close it if you want until I can get you more info, as appreciate it’s difficult to do anything with what I’ve written here but as I say this was more a heads up, and a vague hope someone would have an idea what it might be 😞