checkout: Checkout suddenly much slower on windows environment

Moved from https://github.com/actions/runner-images/issues/7180

Description

Since a few days, the duration of the actions/checkout@v3 steps on windows-2019 have dramatically increased. This behavior is seen on all my repos (all private). Below a table showing an example of before/after.

Checkout step Feb 17th Feb 21st
1 13s 1m35s
2 8s 47s

The result is a huge increase in build (and billable) time.

image

image

The github status page does show some issues around this time frame, but these were all solved:

image

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 18.04
  • Ubuntu 20.04
  • Ubuntu 22.04
  • macOS 11
  • macOS 12
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

Current runner version: '2.301.1'
Operating System
  Microsoft Windows Server 2019
  10.0.17763

Private repo

Is it regression?

Yes, sorry private repos

Expected behavior

The build times should be fairly constant.

Actual behavior

Build times explode. Burning down our build minutes too fast.

Repro steps

Compare build times on any windows environment from before Feb 18th with today.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 10
  • Comments: 30 (9 by maintainers)

Commits related to this issue

Most upvoted comments

You can collect ETW traces to help diagnose what’s happening and taking time during checkout action.

As mentioned in a comment above, can you try this? https://github.com/BrettDong/Cataclysm-DDA/blob/etw/.github/workflows/etw.yml

I don’t have access to larger runners currently to test myself.

Yeah, but presumably so can the GitHub engineers who support says are working to fix this. Like, IDK, it kind of seems to me that the people who wrote this code, control these VMs, and whom we are paying for this service, could maybe take a look at the issues with it.

I’ve encountered this and debugged it a bit, by heavily instrumenting the transpiled code in dist/index.js. Turns out that there are multiple calls to PowerShell that make a mess by essentially waiting for dozens of seconds. Here is the example run I analyzed:

  • here, 1 minute and 14 seconds are spent,
  • here, 11 seconds are spent,
  • here, 12 seconds are spent,
  • here, 11 seconds are spent,

all on this PowerShell call:

(Get-CimInstance -ClassName Win32_OperatingSystem).caption

totaling 1 minute and 48 whopping seconds just to determine a Windows release that the actions/checkout Action has no need to know.

To make things even worse, the same PowerShell invocation also happens multiple times during the post-action:

I do not really understand what changed between Feb 17th and 21st that would explain this slowdown. There has not been a new actions/checkout release since Jan 5th…

We are discontinuing our use of GH managed windows runners. The costs were already beyond premium/sustainable, and the performance is so poor that the issue compounds out of control. I don’t consider this a viable way to run CI for any business.

I can tolerate a lot but not at massively inflated prices.

Just a reminder, given the evidence in https://github.com/actions/runner-images/issues/7320, there is almost certainly an underlying issue that is not specific to git at all. Rather, all operations that place high demands on disk (and CPU).

The fix in #1246 reduces stalled time down to 3 seconds.

During the 3 seconds the workflow is stalled on loading the node.exe from disk to memory for execution. I don’t think there is anything I can do to get rid of it.

I have observed this too. Example job: https://github.com/Tyrrrz/CliWrap/actions/runs/4271628026

  • Windows – checkout took 4m9s:

image

  • Ubuntu – checkout took <1s:

image

  • macOS – checkout took 3s:

image

Note that this behavior is pretty inconsistent, and usually checkout completes pretty fast on Windows. But in the event that it is slow, it’s only ever slow on Windows.

Thanks Ben. Further runs suggest that my switch to use git commands directly instead of actions/checkout was just lucky the first few times (or the computer learned what I was trying to do and put a stop to it 😛). Subsequent runs have had similar latency to before the switch, I think (I started hacking together a script to collect statistics for jobs over time, but got side-tracked, so pure anecdata right now). So I’m back to thinking it’s the VM+git itself and not the action. I am sort of considering getting tarballs for all of the submodules instead of using git… I’ll update if that seems to be faster somehow (which would suggest to me something git-specific and not just IO or network issues)

👋 hey sorry for not seeing this one, we are tracking/providing commentary on our investigation on the windows larger runner checkout here: oops thats the internal issue and no good to anyone

Ok that first update was all kinds of wrong let me try again!

Sorry we haven’t commented on this ticket still, we are tracking this internally. We are have made some changes to the Windows VM image but only recently and the don’t appear to have helped. With everything else going on we have had to put this one aside as well for the last couple of weeks but we are committed to fixing this. I will re-open this ticket as this is linked in the issue we are tracking internally 😃

@GMNGeoffrey I would like to encourage you to consider the current macro-economical climate, and also just how large he roadmap is. And please also note how @BrettDong’s effort was rewarded: I am sure that you also will get what you want much quicker if you dig a little deeper with those ETW traces. I would even consider to help, but I do not have access to those large runners; You do, though.

You can collect ETW traces to help diagnose what’s happening and taking time during checkout action.

Seems like this is also something that @Gakk is hitting.

@GMNGeoffrey, I have done extensive testing and confirmed that my issues were resolved by actions/checkout version 3.5.1.

@BrettDong excellent! I tested this and the times are back to decent levels: the entire checkout step takes 11 seconds with your PR. Thank you so much!

I have briefly seen success trying to hard-code the fix of https://github.com/sindresorhus/windows-release/pull/18 into dist/index.js but it does not seem to work when I undelete all the core logic. sigh

At this point, I have to tend to other things, but I thought I’d leave my current findings here in case somebody else can take over (and try things like overriding os.release() to always return 2022 or something like that, before letting @octokit/endpoint initialize its User-Agent string).