sccache: GitHub Actions cache is surprisingly slow

I was excited to see the introduction of the GitHub Actions cache backend. However, when I tried it on a test branch, the results were surprisingly slow.

See the initial run at https://github.com/indygreg/apple-platform-rs/actions/runs/3723171428/jobs/6314513850. Here, a build job took ~1 hour. (It normally takes ~10m.) The Stop sccache step displays sccache server stats. This particular job shows 44 cache errors. Unsure if that is possibly the cause of problems?

I did another CI run where I cranked up logging to debug and printed the log file after the build. See e.g. https://github.com/indygreg/apple-platform-rs/actions/runs/3723354206/jobs/6314814370 for a slow build. Interestingly, some of the jobs in this run were fast! See https://github.com/indygreg/apple-platform-rs/actions/runs/3723354206/jobs/6314814955 for a fast one. Spot checking the logs, I think there is a positive correlation between the cache errors count and job time.

Poking around the logs, I see occurrences of DEBUG sccache::compiler::compiler] [aws_http]: Cache write error: Cache service responded with 429 Too Many Requests: {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}. So I guess my project is too big to use GitHub Actions cache???

More worrisome though is that some of the cache writes take 10+s. e.g.

2022-12-18T05:13:25.3938134Z [2022-12-18T05:08:46Z DEBUG hyper::client::pool] reuse idle connection for ("https", artifactcache.actions.githubusercontent.com)
2022-12-18T05:13:25.3938390Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::io] flushed 2318 bytes
2022-12-18T05:13:25.3938682Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::io] parsed 18 headers
2022-12-18T05:13:25.3939035Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::conn] incoming body is chunked encoding
2022-12-18T05:13:25.3939330Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::decode] incoming chunked header: 0x3 (3 bytes)
2022-12-18T05:13:25.3939686Z [2022-12-18T05:08:46Z WARN  reqwest_retry::middleware] Retry attempt #1. Sleeping 53.097849ms before the next attempt
2022-12-18T05:13:25.3940103Z [2022-12-18T05:08:46Z DEBUG hyper::client::pool] reuse idle connection for ("https", artifactcache.actions.githubusercontent.com)
2022-12-18T05:13:25.3940353Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::io] flushed 2318 bytes
2022-12-18T05:13:25.3940598Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::io] parsed 18 headers
2022-12-18T05:13:25.3940878Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::conn] incoming body is chunked encoding
2022-12-18T05:13:25.3941175Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::decode] incoming chunked header: 0x3 (3 bytes)
2022-12-18T05:13:25.3941508Z [2022-12-18T05:08:46Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; otel.status_code="ERROR"
2022-12-18T05:13:25.3941845Z [2022-12-18T05:08:46Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.status_code=429
2022-12-18T05:13:25.3942174Z [2022-12-18T05:08:46Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.user_agent=""
2022-12-18T05:13:25.3942478Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::decode] incoming chunked header: 0x138 (312 bytes)
2022-12-18T05:13:25.3942793Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::conn] incoming body completed
2022-12-18T05:13:25.3943218Z [2022-12-18T05:08:46Z DEBUG hyper::client::pool] pooling idle connection for ("https", artifactcache.actions.githubusercontent.com)
2022-12-18T05:13:25.3944403Z [2022-12-18T05:08:46Z DEBUG sccache::compiler::compiler] [aws_sdk_s3]: Cache write error: Cache service responded with 429 Too Many Requests: {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}
2022-12-18T05:13:25.3944708Z [2022-12-18T05:08:46Z DEBUG sccache::server] [aws_sdk_s3]: Cache write finished in 21.388 s
2022-12-18T05:13:25.3945012Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::decode] incoming chunked header: 0x138 (312 bytes)
2022-12-18T05:13:25.3945338Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::decode] incoming chunked header: 0x138 (312 bytes)

And

2022-12-18T05:13:25.3908847Z [2022-12-18T05:08:38Z DEBUG hyper::client::pool] pooling idle connection for ("https", artifactcache.actions.githubusercontent.com)
2022-12-18T05:13:25.3909181Z [2022-12-18T05:08:38Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.status_code=204
2022-12-18T05:13:25.3909491Z [2022-12-18T05:08:38Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.user_agent=""
2022-12-18T05:13:25.3909910Z [2022-12-18T05:08:38Z DEBUG sccache::compiler::compiler] [aws_config]: Stored in cache successfully!
2022-12-18T05:13:25.3910204Z [2022-12-18T05:08:38Z DEBUG sccache::server] [aws_config]: Cache write finished in 13.018 s
2022-12-18T05:13:25.3910535Z [2022-12-18T05:08:45Z DEBUG sccache::compiler::compiler] [aws_sdk_s3]: Compiled in 21.388 s, storing in cache
2022-12-18T05:13:25.3910931Z [2022-12-18T05:08:46Z INFO  gha_toolkit::cache] put; version="5c35548282b30b97a2314eadc6fc42b9b7585a12c4f9efa527ac2e48739babe0"
2022-12-18T05:13:25.3911660Z [2022-12-18T05:08:46Z INFO  gha_toolkit::cache] reserve; key="sccache-ubuntu-22.04-x86_64-unknown-linux-gnu-1.62.1-0300a5abb7bd36eb7e4aa0b971d5e94866c6e140" version="7c4ed3dc414c5c2d030ec00d8394e733c5dfeff8cdbae21becc2144de0bb9570" cache_size=21844888
2022-12-18T05:13:25.3912483Z [2022-12-18T05:08:46Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.method=POST http.scheme=https http.host=artifactcache.actions.githubusercontent.com net.host.port=0 otel.kind="client" otel.name=POST /LNWNXsg1z0Qt0Mn41fuRdbA1YzcaL5Z2Wf2AYmEQiLGLuQCklb/_apis/artifactcache/caches

I’m not sure how to interpret these logs. But multi-second values for Cache write finished in seem wrong: I would expect a cache write to effectively be an HTTP request and for inserts to take dozens to hundreds of milliseconds. But maybe the GitHub Actions cache latency is just really high?

I’m unsure what actions are needed for this issue. I’m unsure if there’s any code-level bugs. Maybe GitHub Actions cache is just prohibitively slow? Maybe this slowness/behavior should be documented?

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 1
Comments: 26 (4 by maintainers)

Most upvoted comments

@rajivshah3 in the future, please strip the color string, it makes the log harder to read 😉

For cache write, we don’t need to be super fast. Maybe a queue would do. @Xuanwo do you think you could do that once you added the opendal support for GHA? (or as part of the change?)

sylvestre on Jan 2, 2023

I thought so too, but sccace --show-stats does not show any read or write errors. Maybe they are not reported for GHAC?

onbjerg on Feb 17, 2023

Hi, @indygreg. Here are our updates on addressing this issue.

I have a test with my two patches: https://github.com/mozilla/sccache/pull/1557 & https://github.com/mozilla/sccache/pull/1550 on apple-platform-rs.

The biggest lesson I have learned from this is that sccache is a cache service, and we don’t need to ensure every write/read succeeds. So instead of waiting to retry limited requests, we will skip this step and keep going.

Here are the differences:

Before

After

No failed build anymore

The action is here https://github.com/Xuanwo/apple-platform-rs/pull/1

Xuanwo on Jan 12, 2023

Hi, @indygreg, I re-produced you problem with the v0.4.0-pre.5 release at https://github.com/Xuanwo/apple-platform-rs/actions/runs/3871176193/jobs/6598713731

The real problem here is ghac will return a very long Retry-After which could up to 60s. So if we follow this limit strictly, we will observe the surprisingly slow.

The full error message here:

sccache: error: Server startup failed: cache storage failed to write: Unexpected (permanent) at write => {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}

Context:
    response: Parts { status: 429, version: HTTP/1.1, headers: {"transfer-encoding": "chunked", "content-type": "application/json", "retry-after": "60", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "daad0d22-be5d-4106-9f45-205eb7eedac3", "activityid": "ec2715a2-3f3a-47e2-ac7a-efad9555e782", "x-tfs-session": "ec2715a2-3f3a-47e2-ac7a-efad9555e782", "x-vss-e2eid": "ec2715a2-3f3a-47e2-ac7a-efad9555e782", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Short", "x-ratelimit-limit": "758", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1673244953", "x-tfs-serviceerror": "Request+was+blocked+due+to+exceeding+usage+of+resource+%27Count%27+in+namespace+%27%27.", "x-cache": "CONFIG_NOCACHE", "x-msedge-ref": "Ref A: 489C21A91BE741E29D754583C16E0C7C Ref B: PAOEDGE0516 Ref C: 2023-01-09T06:11:06Z", "date": "Mon, 09 Jan 2023 06:11:06 GMT"} }
    called: Backend::ghac_reserve
    service: ghac
    path: .sccache_check

Returns a cache miss or give up this cache write maybe a better choice.

Xuanwo on Jan 9, 2023

yes, please do, this should not be done on important project

sylvestre on Jan 3, 2023

ok 😦 this is a bit surprising …

sylvestre on Jan 3, 2023

@Xuanwo you are auto approving your patches in opendal ?

sylvestre on Jan 3, 2023