cache: Github action is stuck when trying to restore the cache

👋 It seems I have encountered an issue while trying to restore the cache on Github actions. (Sorry I cannot post the repo as is a private one).

Here is some of the cache setup and a screenshot of the action stuck on cache restore.

with:
  path: ./vendor/bundle
  key: v1-bundle-${{ runner.OS }}-${{ hashFiles('.github/workflows/main.yml') }}-gemfile-${{ hashFiles('Gemfile.lock') }}
  restore-keys: |
    v1-bundle-${{ runner.OS }}-${{ hashFiles('.github/workflows/main.yml') }}-gemfile-${{ hashFiles('Gemfile.lock') }}
    v1-bundle-${{ runner.OS }}-${{ hashFiles('.github/workflows/main.yml') }}-
    v1-bundle-${{ runner.OS }}-

As it appears in the screenshot I had to stop the action after 15mins. I can see from some data points that the cache restore usually fluctuates between 5sec - 1min and the cache size: is around 150 MB .

I hope the above are helpful. 🙏

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 6
Comments: 16 (6 by maintainers)

Most upvoted comments

👋 Hey @kevinrobinson thanks for reporting this. We’re looking into our telemetry and seeing what could cause these cache restore timeouts.

How often would you say you hit these issues? Is it always with workflows that multiple jobs pulling from the same cache?

Hey @joshmgross I can also confirm that the cache freezes on parallel jobs that pull from the same cache that a depended job has created.

Unfortunately logs do not provide enough data. On the last line it’s me cancelling the job after 11 minutes for no progression.

2020-04-15T12:55:53.6484950Z ##[group]Run actions/cache@v1
2020-04-15T12:55:53.6485081Z with:
2020-04-15T12:55:53.6485178Z   path: ./vendor/bundle
2020-04-15T12:55:53.6485313Z   key: v1-bundle-Linux-fc51bfd6d906381f2bcacc1b5b1984f16461e7b18005cd73f235c4e3eab32a67-gemfile-badc4979dafd7179b692ffe63d166b2f09b6d649d7a65a659f8ee6f0e17e3d33
2020-04-15T12:55:53.6485493Z   restore-keys: v1-bundle-Linux-fc51bfd6d906381f2bcacc1b5b1984f16461e7b18005cd73f235c4e3eab32a67-gemfile-badc4979dafd7179b692ffe63d166b2f09b6d649d7a65a659f8ee6f0e17e3d33
v1-bundle-Linux-fc51bfd6d906381f2bcacc1b5b1984f16461e7b18005cd73f235c4e3eab32a67-
v1-bundle-Linux-

2020-04-15T12:55:53.6485683Z env:
2020-04-15T12:55:53.6485784Z   CONTINUOUS_INTEGRATION: true
2020-04-15T12:55:53.6485887Z   COVERAGE: true
2020-04-15T12:55:53.6485985Z   DISABLE_SPRING: true
2020-04-15T12:55:53.6486081Z   PGHOST: 127.0.0.1
2020-04-15T12:55:53.6486179Z   PGUSER: --
2020-04-15T12:55:53.6486282Z   PGPASSWORD: --
2020-04-15T12:55:53.6486351Z   PROFILE_RSPEC: true
2020-04-15T12:55:53.6486500Z   RAILS_CACHE_CLASSES: true
2020-04-15T12:55:53.6486602Z   RAILS_EAGER_LOAD: true
2020-04-15T12:55:53.6486699Z   RAILS_ENV: test
2020-04-15T12:55:53.6487172Z   RAILS_MASTER_KEY: ***
2020-04-15T12:55:53.6487246Z   SPEC_TIMEOUT_ENABLED: true
2020-04-15T12:55:53.6487348Z ##[endgroup]
2020-04-15T13:11:55.7018370Z ##[error]The operation was canceled.

Interesting thing is that it does not happens always. It was quite stable for 3 weeks and started again last week. I’ve changed the workflows/main.yml to make a few extra checks if: steps.cache_bundler.outputs.cache-hit != 'true' in case cache tries to be updated while there is a cache hit. I’ll post again if I have more findings. Thanks 😃

berkos on Apr 23, 2020

@joshmgross sure! I’ve only set up GitHub Actions for one repository, and all the Ruby tasks for that repository run multiple jobs in parallel that all read from the same cache. It’s to parallelize rspec tests in CI (which, when the caching is working, is amazing and cuts the total test time down to 25% of what it was before! 👍). I’ve run into this issue on maybe ~40% of the days I’ve run these tasks over the last week or two, so timeouts and latency degradation this severe have been a major recurring issue, and not isolated to a single point in time.

I’ve shared some links and debug logs above, but let me know whatever other info would help y’all debug. Thanks! 😃

kevinrobinson on Apr 17, 2020

In the last ~30 minutes, I’ve seen some tasks successfully read the cache in 2sec, and others hit the 10m job timeout I added and have still not be able to read the cache.

I’ve also seen the step complete in <1 sec because the cache read failed, with the message [warning]No scopes with read permission were found on the request. This has happened across other tasks with different cache reads as well (eg, node_modules caching). This is a better failure mode than just hanging until it hits the job timeout, but of course the cache being down for long periods of time is also a problem 😃

It seems like there must be a service outage or something, but there’s nothing shown at https://www.githubstatus.com/) Screen Shot 2020-04-08 at 4 58 14 PM

EDIT: I think the permissions is a separate issue, since I’m also seeing this on cache writes now, so going to open that separately in https://github.com/actions/cache/issues/251. EDIT2: turns out that root cause is from https://github.com/actions/cache/issues/208#issuecomment-607877125.

kevinrobinson on Apr 9, 2020

V2 of actions/cache has multiple improvements to cache downloads and we’ve fixed the issue with re-running pull requests.

joshmgross on Aug 7, 2020

Cache reads are again stalling for >10 minutes and hitting the job timeout. Here’s another example of what I see with debug logs enabled (link). These are the last few lines:

Thu, 16 Apr 2020 21:43:50 GMT ##[debug]Cache Result:
Thu, 16 Apr 2020 21:43:50 GMT ##[debug]{"scope":"refs/heads/master","cacheKey":"Linux-gem-cache-d67c9ac3444adb7e01b85bdf1a1c93ca65169840573b1d308b8a1f8294d52a0a","cacheVersion":null,"creationTime":"2020-04-08T18:47:25.01Z","archiveLocation":"***"}
Thu, 16 Apr 2020 21:43:50 GMT ##[debug]Archive Path: /home/runner/work/_temp/b7309a7c-54a3-40d5-b0fc-3d51143443ab/cache.tgz
Thu, 16 Apr 2020 21:43:50 GMT ::save-state name=CACHE_RESULT,::{"scope":"refs/heads/master","cacheKey":"Linux-gem-cache-d67c9ac3444adb7e01b85bdf1a1c93ca65169840573b1d308b8a1f8294d52a0a","cacheVersion":null,"creationTime":"2020-04-08T18:47:25.01Z","archiveLocation":"***"}
Thu, 16 Apr 2020 21:43:50 GMT ##[debug]Save intra-action state CACHE_RESULT = {"scope":"refs/heads/master","cacheKey":"Linux-gem-cache-d67c9ac3444adb7e01b85bdf1a1c93ca65169840573b1d308b8a1f8294d52a0a","cacheVersion":null,"creationTime":"2020-04-08T18:47:25.01Z","archiveLocation":"***"}

Then I see this ~10 min later:

Thu, 16 Apr 2020 21:53:28 GMT ##[debug]Re-evaluate condition on job cancellation for step: 'Setup gem caching'.
Thu, 16 Apr 2020 21:53:28 GMT ##[error]The operation was canceled.

This was from a job where 8 similar cache reads run in parallel. Two timed out altogether, others took ~3min, and others took ~1min (which has been typical over the last week or two). For the jobs where the cache reads succeeded, this is what I see afterward (link):

Thu, 16 Apr 2020 21:46:50 GMT Cache Size: ~1717 MB (1800915680 B)
Thu, 16 Apr 2020 21:46:50 GMT /bin/tar -xz -f /home/runner/work/_temp/7cd02336-395a-46fa-bff8-c5f640f44e87/cache.tgz -C /home/runner/work/studentinsights/studentinsights/vendor/bundle
Thu, 16 Apr 2020 21:47:13 GMT ::set-output name=cache-hit,::true
Thu, 16 Apr 2020 21:47:13 GMT ##[debug]='true'
Thu, 16 Apr 2020 21:47:13 GMT Cache restored from key: Linux-gem-cache-d67c9ac3444adb7e01b85bdf1a1c93ca65169840573b1d308b8a1f8294d52a0a
Thu, 16 Apr 2020 21:47:13 GMT ##[debug]Node Action run completed with exit code 0
Thu, 16 Apr 2020 21:47:13 GMT ##[debug]Finishing: Setup gem caching

I’ve seen other issues about the cache size continually growing (eg https://github.com/actions/cache/issues/204 for Maven and https://github.com/actions/cache/issues/201#issuecomment-596197755 for Ruby, thread on Reddit: https://www.reddit.com/r/github/comments/fsdnv6/cache_wont_work_with_github_actions_and_ruby/). But even if that’s an issue I’m not sure how that explains the extreme variability in latency we’re seeing.

kevinrobinson on Apr 16, 2020

I’m seeing the same kind of thing impacting both Ruby tasks, with latency spikes severe enough that performance degrades worse than if there were no caching at all.

This is coming up most in a build where tasks are parallelized to run subsets of tests, with say 8 shards running in parallel (related issue with less activity: https://github.com/actions/cache/issues/189). In about half of the runs, one of the shards will take more time to set up the cache than it takes to run the tests, but the other 7 shards will set up the cache in a few seconds as expected. The cached folder size is ~600mb in this situation (I also see https://github.com/actions/cache/pull/202 about changing the compression for cached objects is still in progress).

Here’s an example of expected response time (~20 seconds): Screen Shot 2020-04-06 at 11 27 18 AM

And a degraded state for the same cache read from another shard, over 4min (>10x the expected latency): Screen Shot 2020-04-06 at 11 27 00 AM

(EDIT1: A few days later, I’m seeing latency vary from 2s to hitting the 10m timeout I set. The cache size has also grown to 1.7GB!)

I’ve also previously seen that the cache read can timeout within 30 seconds (it appears this happens when it gets an ECONNRESET upstream, see https://github.com/studentinsights/studentinsights/pull/2801/checks?check_run_id=564603328 for an example): Screen Shot 2020-04-06 at 11 50 59 AM

This is what I’d expect the worst-case failure of the cache service to look like. So perhaps there is some kind of timeout bug in the code doing cache reads, or the cache reads need additional monitoring to abort if the download rate falls much lower than expected? If the cache read action fails within 30 seconds, the job can fall back and that seems reasonable if say the caching service is down. But latency spikes this severe are worse than the service being down altogether or if there were no caching at all :\

I’m doing some more runs with debugging enabled (thanks!), and reading up on https://github.com/actions/cache/issues/204 more now too, but happy to share any other info that would help debug on your end, thanks! 👍

kevinrobinson on Apr 8, 2020