runner: Very slow queuing with plenty of idle runners available

Originally posted here:

https://github.community/t/very-slow-queuing-behavior-when-idle-runners-are-available/127674

Describe the bug When no other builds are running (all my runners are idle), there is very delayed behavior from GH actions before builds even start. This is especially noticeable with, for example, a 4^4 matrix (256 checks), even if you have 256 idle self-hosted runners.

The UI shows “X queued checks” at a rate of around 4 per second (ie, “4 queued checks”, “8 queued checks”, etc), before it finally gets to 256 checks queued. It takes a full 1min 40sec before the first of my runners even receives a message and starts building. It takes 12-13min for the entire run to be marked as finished, even if each build does no work and completes in 1sec or less.

To Reproduce

  1. Register (and run) 256 self-hosted runners
  2. Run a workflow that uses a 4^4 matrix:
    strategy:
      matrix:
        ix1: [ 0, 1, 2, 3 ]
        ix2: [ 0, 1, 2, 3 ]
        ix3: [ 0, 1, 2, 3 ]
        ix4: [ 0, 1, 2, 3 ]
  1. Observe that it takes a long time before the first build message is sent to a runner
  2. Observe how, even after all checks are completed, it still takes many minutes (10?) for the entire workflow to be marked as finished

You can also observe similar behavior with fewer checks – eg, even just 16 runners and a 4^2 matrix. Even then, checks are queued before the first build will start – and there’s a noticeable delay after the 16th check has finished before the whole workflow is marked as complete. I see an overall run time of 1min10sec – even though each worker has completed its build in less than a second

Expected behavior

  1. There should be no queueing when idle runners are available
  2. Runners should receive build messages with no delay – within a few seconds at most.
  3. Workflows should be marked as finished soon after the final check is complete.

Runner Version and Platform

Checked 2.272.0 and 2.273.0

Checked on OSX and Linux

What’s not working?

This is not a problem with the runner (ie, the software in this repo) in any way as far as I can tell – it’s purely behavior on the service-side that’s causing this.

I suspect there’s some sort of serial process that just gets compounded the more parallel jobs you have (which defeats the purpose of having parallel jobs in the first place)

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 17 (1 by maintainers)

Commits related to this issue

Most upvoted comments

@hross This is still a problem over a year later - is work still being done here?

Our self-hosted runner infrastructure is scaling automatically, downscaling removes runners that have been idle for a certain amount of time. With sane idle time limits (~10 minutes), runners get terminated before jobs from our (ever growing) queue are assigned to them.

This issue costs us quite a bit of time and money.

It happened to me with Cirun, the runner registered and idel was assigned a job 30 seconds later. Does this have anything to do with the --ephemeral argument?

Really interested with this. Is there any progress @hross ?

We’re in the process of setting up ephemeral runners in AWS with the Philips-Labs solution. Runners are created fine but the instances sit there ready for the job for an age.

Best so far is 2mins, worst was over 25mins. The frustrating thing is that the job does eventually run, I’d almost prefer it didn’t and errored with something to hang our hat on.

Any assistance or tweaks would be much appreciated.

We have an internal ticket tracking this issue (related to the community feedback item). It is a back end infrastructure issue that we are working on but don’t yet have a good solution for. Since it’s not related to the runner, I’m going to close this issue (but we will keep an eye on the community feedback ticket and track this internally).

Thanks for bringing this up @mhart