coveragepy: Process hang with Coverage 6.3
Describe the bug A clear and concise description of the bug.
We’ve been having issues with our CI in GitHub Actions for the last few hours, and think it might be because of Coverage 6.3 - it’s the one thing that’s changed, and freezing it at 6.2 seems to allow runs to complete successfully.
To Reproduce How can we reproduce the problem? Please be specific. Don’t link to a failing CI job. Answer the questions below:
- What version of Python are you using? 3.7 seems to be a bit buggy with this. We also run 3.8 - 3.10 in CI and no issues seen with that.
- What version of coverage.py shows the problem? The output of
coverage debug sys
is helpful. 6.2.3 - What versions of what packages do you have installed? The output of
pip freeze
is helpful. See https://gist.github.com/tunetheweb/4d288ea4467ba74a66b3a0e2e8d5e4ea - What code shows the problem? Give us a specific commit of a specific repo that we can check out. If you’ve already worked around the problem, please provide a commit before that fix. This is tricky. We run a lot of commands in CI, but checking out https://github.com/sqlfluff/sqlfluff/ and running
tox -e py37 -- -n 2 test
should reproduce it. Having problems setting up a 3.7 environment but will try to get a better test case. We do use a multithreaded process and noticed some changes to that. - What commands did you run?
Expected behavior A clear and concise description of what you expected to happen.
Additional context Add any other context about the problem here.
Will try to get a better repo.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 24
- Comments: 74 (22 by maintainers)
Links to this issue
Commits related to this issue
- Pin coverage at 6.2 for the moment https://github.com/nedbat/coveragepy/issues/1310 — committed to vertexproject/synapse by vEpiphyte 2 years ago
- Pin coverage==6.2 to fix tests See https://github.com/nedbat/coveragepy/issues/1310 and https://github.com/nedbat/coveragepy/issues/1312. — committed to quiltdata/quilt by sir-sigurd 2 years ago
- Pin coverage==6.2 to fix tests (#2634) See https://github.com/nedbat/coveragepy/issues/1310 and https://github.com/nedbat/coveragepy/issues/1312. — committed to quiltdata/quilt by sir-sigurd 2 years ago
- Test a coverage fix proposed from https://github.com/nedbat/coveragepy/issues/1310 https://github.com/nedbat/coveragepy/issues/1312 — committed to vertexproject/synapse by vEpiphyte 2 years ago
- pin coverage to 6.2 Fixes https://github.com/dask/distributed/issues/5712 see https://github.com/nedbat/coveragepy/issues/1307 and https://github.com/nedbat/coveragepy/issues/1310 — committed to graingert/distributed by graingert 2 years ago
- Pin coverage to version 6.2 to fix CI hanging See https://github.com/nedbat/coveragepy/issues/1310 — committed to maxnoe/gammapy by maxnoe 2 years ago
- Pin coverage to 6.2 in CI due to nedbat/coveragepy#1310 — committed to cta-observatory/cta-lstchain by maxnoe 2 years ago
- Exclude coverage 6.3 in CI due to nedbat/coveragepy#1310 — committed to cta-observatory/ctapipe by maxnoe 2 years ago
- Exclude coverage 6.3 in CI due to nedbat/coveragepy#1310 — committed to cta-observatory/cta-lstchain by maxnoe 2 years ago
- Merge pull request #889 from cta-observatory/fix_ci_hangs_coverage Pin coverage to 6.2 in CI due to nedbat/coveragepy#1310 — committed to cta-observatory/cta-lstchain by rlopezcoto 2 years ago
- reset CoverageData._lock at fork might help with #1310 — committed to graingert/coveragepy by graingert 2 years ago
- reset CoverageData._lock at fork might help with #1310 — committed to graingert/coveragepy by graingert 2 years ago
- Merge pull request #1830 from cta-observatory/fix_ci_hangs_coverage Exclude coverage 6.3 in CI due to nedbat/coveragepy#1310 — committed to cta-observatory/ctapipe by maxnoe 2 years ago
- Fix the CI builds Force coverage to < 6.3 to prevent https://github.com/nedbat/coveragepy/issues/1310 — committed to rosswhitfield/IPS-framework by rosswhitfield 2 years ago
- Exclude "Coverage" 6.3. References nedbat/coveragepy#1310 — committed to colour-science/colour by KelSolaar 2 years ago
- fix: use a re-entrant lock to avoid self-deadlocking #1310 — committed to nedbat/coveragepy by nedbat 2 years ago
- fix: use a re-entrant lock to avoid self-deadlocking #1310 — committed to nedbat/coveragepy by nedbat 2 years ago
- reset CoverageData._lock at fork might help with #1310 (cherry picked from commit 4a42e487bb91fb20711c2eec7bce1b17b81995da) — committed to nedbat/coveragepy by graingert 2 years ago
- Temporarily pin `coverage` https://github.com/nedbat/coveragepy/issues/1310 — committed to apriha/snps by apriha 2 years ago
- Skip coverage 6.3.0 since it blocks tests See https://github.com/pytest-dev/pytest-cov/issues/520 and https://github.com/nedbat/coveragepy/issues/1310 — committed to BoboTiG/ebook-reader-dict by BoboTiG 2 years ago
This is now released as part of coverage 6.4.
I’ve made the SIGTERM handler opt-in, so these issues should now be fixed. Commited in 803a5494ef23187e920eeb4b42e922b87cda5966
@benmwebb @anishathalye @BoboTiG @JannisNe @pietrodn @osma @apriha @pllim @KelSolaar @rosswhitfield @JoanFM @erykoff @haampie @Bultako @maxnoe @rlopezcoto @glemaitre @nmdefries @QuLogic @akihironitta @sir-sigurd @cliffckerr Would you try coverage 6.3.1 to see if it fixes your problems?
Ironically the merge request hung. But the rerun completed OK. So concur with above that it’s better, but not solved.
This is caused by running code coverage on a function that uses
multiprocessing.Pool
to fork threads.The test suite hangs on this line of
coverage
, specifically:Child threads are unable to acquire an available mutex; when they are created, they are given a copy of an (unavailable) mutex that is thus never updated (some info about why). Probably related to this change “Feature: coverage measurement data will now be written when a SIGTERM signal is received by the process.”
Fixed in my case by any of: using
multiprocessing
to “spawn” instead of forking, reverting to version 6.2.0, or turning off code coverage.Edit: Full stack trace of stalled thread
Thanks for the fix! Which release will this be in? 🙏
This will become a 6.4 release, though I’m not sure when. It would be great if people could do a test with the commit from GitHub:
(this will claim a version of 6.3.4a0, which is fine.) If something still seems amiss, please open a new issue.
if it helps, it seems that this “patch” fixes the “bug” – or at least I can’t reproduce the deadlock after that
Yes I second that, this type of problem is a probably one of the worst thing to debug. Thanks again!
Seems to work for my simple test case. Thanks for being so responsive on this!
I’ve released coverage 6.3.1 with the RLock fix. I’m not sure what to do with the rest of this issue…
Here’s a pretty minimal test case that I cobbled together:
process.py
:test_process.py
:Install the dependencies (in a venv):
pip install pytest pytest-cov coverage==6.3
Run the test with:
On a 2-core machine (Ubuntu 20.04 amd64, Python 3.8.10) this hangs maybe 50% of the time. If it doesn’t, it will finish in less than 10 seconds. With coverage 6.2, it works every time.
@osma yes, a local reproduction case would be very helpful.
They are on separate machines. Also I did a run with the others turned off and it still hung.
I’m trying some proposed fixes. They reduce the likelihood of hangs, but it’s not 100%: https://github.com/nedbat/sqlfluff/actions
With sqlfluff a notable difference I see between the local example and the CI config is the inclusion of pytest-xdist via the
-n 2
flag.Yes I appreciate this is a very poor issue to raise. Sorry about that, but thought I’d give you a heads up as saw it right after the release, and it stopped happening right after I pinned to the old version.
Will try to give a more meaningful reproducible use case if I can narrow it down.
The issue is jobs hanging (I cancelled them after an hour, when they normally take 5 mins. One job I left running and it took 3 hours and counting). That was for Python 3.7. I also saw a lot of slow down in jobs (5 min jobs were more often than not taking 20mins when they did complete). That was for Python 3.8+.
Initially I thought GitHub Actions were on the blink but at soon as I pinned the old version of coverage it all worked. Unpinning it again breaks it again. Nothing else changes between the runs.
But as to narrowing it down why, at the moment I don’t have more info to help explain that. But will keep digging…
Think the error message is a red-herring. It was the last thing printed before it froze. But do see that at the end of the run elsewhere (though interestingly I only see it at the end of good runs, whereas that appeared after our CI job was only at 90% and then hung).
Sorry, again for such a poor report. Can close it if you want until I can get you more info, as appreciate it’s difficult to do anything with what I’ve written here but as I say this was more a heads up, and a vague hope someone would have an idea what it might be 😞