cache: Github action is stuck when trying to restore the cache
👋 It seems I have encountered an issue while trying to restore the cache on Github actions. (Sorry I cannot post the repo as is a private one).
Here is some of the cache setup and a screenshot of the action stuck on cache restore.
with:
path: ./vendor/bundle
key: v1-bundle-${{ runner.OS }}-${{ hashFiles('.github/workflows/main.yml') }}-gemfile-${{ hashFiles('Gemfile.lock') }}
restore-keys: |
v1-bundle-${{ runner.OS }}-${{ hashFiles('.github/workflows/main.yml') }}-gemfile-${{ hashFiles('Gemfile.lock') }}
v1-bundle-${{ runner.OS }}-${{ hashFiles('.github/workflows/main.yml') }}-
v1-bundle-${{ runner.OS }}-
As it appears in the screenshot I had to stop the action after 15mins. I can see from some data points that the cache restore usually fluctuates between 5sec - 1min and the cache size: is around 150 MB .
I hope the above are helpful. 🙏
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 6
- Comments: 16 (6 by maintainers)
Hey @joshmgross I can also confirm that the cache freezes on parallel jobs that pull from the same cache that a depended job has created.
Unfortunately logs do not provide enough data. On the last line it’s me cancelling the job after 11 minutes for no progression.
Interesting thing is that it does not happens always. It was quite stable for 3 weeks and started again last week. I’ve changed the
workflows/main.yml
to make a few extra checksif: steps.cache_bundler.outputs.cache-hit != 'true'
in case cache tries to be updated while there is a cache hit. I’ll post again if I have more findings. Thanks 😃@joshmgross sure! I’ve only set up GitHub Actions for one repository, and all the Ruby tasks for that repository run multiple jobs in parallel that all read from the same cache. It’s to parallelize rspec tests in CI (which, when the caching is working, is amazing and cuts the total test time down to 25% of what it was before! 👍). I’ve run into this issue on maybe ~40% of the days I’ve run these tasks over the last week or two, so timeouts and latency degradation this severe have been a major recurring issue, and not isolated to a single point in time.
I’ve shared some links and debug logs above, but let me know whatever other info would help y’all debug. Thanks! 😃
In the last ~30 minutes, I’ve seen some tasks successfully read the cache in 2sec, and others hit the 10m job timeout I added and have still not be able to read the cache.
I’ve also seen the step complete in <1 sec because the cache read failed, with the message
[warning]No scopes with read permission were found on the request.
This has happened across other tasks with different cache reads as well (eg,node_modules
caching). This is a better failure mode than just hanging until it hits the job timeout, but of course the cache being down for long periods of time is also a problem 😃It seems like there must be a service outage or something, but there’s nothing shown at https://www.githubstatus.com/)
EDIT: I think the permissions is a separate issue, since I’m also seeing this on cache writes now, so going to open that separately in https://github.com/actions/cache/issues/251. EDIT2: turns out that root cause is from https://github.com/actions/cache/issues/208#issuecomment-607877125.
V2 of
actions/cache
has multiple improvements to cache downloads and we’ve fixed the issue with re-running pull requests.Cache reads are again stalling for >10 minutes and hitting the job timeout. Here’s another example of what I see with debug logs enabled (link). These are the last few lines:
Then I see this ~10 min later:
This was from a job where 8 similar cache reads run in parallel. Two timed out altogether, others took ~3min, and others took ~1min (which has been typical over the last week or two). For the jobs where the cache reads succeeded, this is what I see afterward (link):
I’ve seen other issues about the cache size continually growing (eg https://github.com/actions/cache/issues/204 for Maven and https://github.com/actions/cache/issues/201#issuecomment-596197755 for Ruby, thread on Reddit: https://www.reddit.com/r/github/comments/fsdnv6/cache_wont_work_with_github_actions_and_ruby/). But even if that’s an issue I’m not sure how that explains the extreme variability in latency we’re seeing.
I’m seeing the same kind of thing impacting both Ruby tasks, with latency spikes severe enough that performance degrades worse than if there were no caching at all.
This is coming up most in a build where tasks are parallelized to run subsets of tests, with say 8 shards running in parallel (related issue with less activity: https://github.com/actions/cache/issues/189). In about half of the runs, one of the shards will take more time to set up the cache than it takes to run the tests, but the other 7 shards will set up the cache in a few seconds as expected. The cached folder size is ~600mb in this situation (I also see https://github.com/actions/cache/pull/202 about changing the compression for cached objects is still in progress).
Here’s an example of expected response time (~20 seconds):
And a degraded state for the same cache read from another shard, over 4min (>10x the expected latency):
(EDIT1: A few days later, I’m seeing latency vary from 2s to hitting the 10m timeout I set. The cache size has also grown to 1.7GB!)
I’ve also previously seen that the cache read can timeout within 30 seconds (it appears this happens when it gets an ECONNRESET upstream, see https://github.com/studentinsights/studentinsights/pull/2801/checks?check_run_id=564603328 for an example):
This is what I’d expect the worst-case failure of the cache service to look like. So perhaps there is some kind of timeout bug in the code doing cache reads, or the cache reads need additional monitoring to abort if the download rate falls much lower than expected? If the cache read action fails within 30 seconds, the job can fall back and that seems reasonable if say the caching service is down. But latency spikes this severe are worse than the service being down altogether or if there were no caching at all :\
I’m doing some more runs with debugging enabled (thanks!), and reading up on https://github.com/actions/cache/issues/204 more now too, but happy to share any other info that would help debug on your end, thanks! 👍