actions-runner-controller: Intermittent permissions errors from multiple GHA runs on self-hosted runners

Describe the bug On multiple GHA workflow runs, occasionally we see permissions errors when the worker attempts to clean up its /runner/_work directory.

2021-08-09T22:15:11.2384401Z [command]/usr/bin/git config --local --get remote.origin.url
2021-08-09T22:15:11.2413148Z https://github.com/org/repo
2021-08-09T22:15:11.2427272Z ##[group]Removing previously created refs, to avoid conflicts
2021-08-09T22:15:11.2431002Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD
2021-08-09T22:15:11.2462004Z HEAD
2021-08-09T22:15:11.2472613Z [command]/usr/bin/git rev-parse --symbolic-full-name --branches
2021-08-09T22:15:11.2503865Z ##[endgroup]
2021-08-09T22:15:11.2505886Z ##[group]Cleaning the repository
2021-08-09T22:15:11.2507411Z [command]/usr/bin/git clean -ffdx
2021-08-09T22:15:11.2604862Z warning: failed to remove .pytest_cache/v/cache/stepwise: Permission denied
2021-08-09T22:15:11.2606855Z warning: failed to remove .pytest_cache/v/cache/nodeids: Permission denied
2021-08-09T22:15:11.2608515Z warning: failed to remove .pytest_cache/README.md: Permission denied
2021-08-09T22:15:11.2609874Z warning: failed to remove .pytest_cache/.gitignore: Permission denied
...
2021-08-09T22:15:11.4287979Z ##[endgroup]
2021-08-09T22:15:11.4295434Z ##[warning]Unable to clean or reset the repository. The repository will be recreated instead.
2021-08-09T22:15:11.4310445Z Deleting the contents of '/runner/_work/org/repo'
2021-08-09T22:15:11.4321152Z ##[error]Command failed: rm -rf "/runner/_work/org/repo/.pytest_cache"
rm: cannot remove '/runner/_work/org/repo/.pytest_cache/v/cache/stepwise': Permission denied
rm: cannot remove '/runner/_work/org/repo/.pytest_cache/v/cache/nodeids': Permission denied
...

Checks

  • [x ] My actions-runner-controller version (v0.x.y) does support the feature
  • I’m using an unreleased version of the controller I built from HEAD of the default branch

To Reproduce Steps to reproduce the behavior:

  1. Create a self-hosted runner for a repo, preferably one with several jobs
  2. Trigger multiple jobs so they queue up
  3. At some point, at least for us, one of the jobs will error, usually on the first step or two, on permissions

Expected behavior I would expect we could queue up these jobs and they would be able to clean themselves without permissions errors.

Environment (please complete the following information):

  • Controller Version = 0.18.2
  • Deployment Method = Kustomize

The example I was able to reproduce here mentions being unable to clean up pycache, but I’ve seen this error in a couple of other plugins, namely the asdf plugin, so we don’t think it has anything to do with pycache in particular. We’ve seen the error in several of our repos also, so it’s not isolated to the repo that produced these logs either.

It seems like the runners at times anyway are not restarting with a clean slate. I’m having a heck of a time getting any more information that might be useful, but I will update this if I find out anything else. Please let me know if there’s anything you would like me to add to this issue that might help you track down the problem.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17 (4 by maintainers)

Most upvoted comments

Do you have any idea when we should expect this fix to be in place? I was thinking of putting a couple of initial steps in my workflows that would remove anything leftover in /runner/_work and /home/runner/ before running the job, but if the race condition will be fixed soon enough I’ll just wait. Thanks again!

@jslusher FYI I saw the latest comment on https://github.com/actions/runner/pull/660 is saying Septermber. Obviously we’re not GitHub employees so we have no idea if they will hit that date or if priorities will shift but that’s the expectation they are setting atm.