actions-runner-controller: enterprise autoscaling issues, indefinitely queued jobs within workflows
hey - first off, thanks for all of your work on this š - itās a great project!
apologies in advance for the lengthy note ā just hoping to provide as much helpful info as possible upfront. nevertheless, a (poor) attempt at the cliffsnotes version:
when a workflow with more jobs than currently-running runner pods is kicked off, sometimes the only jobs that ever run - even if a scale up is later triggered (though that scale-up is never triggered within the
scaleUp
period) - are those that are initially picked up. however, sometimes the remaining queued jobs do run, but only after the first jobs complete. there are no errors in the controller logs in either instance.
scenario
1 workflow with 4 jobs 10s
syncPeriod
minReplicas
: 2
maxReplicas
: 20
scaleUpThreshold
: 0.5
scaleUpFactor
: 3.0
expected behavior
- 2 jobs immediately picked up by the 2 listening pods
- trigger a scale up after the 10s
syncPeriod
completes as >0.5
of the running pods are now ābusyā - this will result in
6
runner pods - the remaining 2 jobs in the workflow would get picked up by the newly created runner pods and run alongside the 2 already running
actual behavior
- 2 jobs immediately picked up by the 2 listening pods
- no scale up after 10s
syncPeriod
- controller logs show:
2021-04-21T18:12:14.019Z DEBUG actions-runner-controller.horizontalrunnerautoscaler Calculated desired replicas of 2 {"horizontalrunnerautoscaler": "github-actions/ghe-runner-deployment-autoscaler", "suggested": 1, "reserved": 0, "min": 2, "cached": 1, "max": 20}
- once a job completes its pod terminates, a new pod is created, and a queued job runs. at this point, no scale up has been triggered yet
- scale up to 6 pods seems to coincide subsequent to the 3rd job starting on a newly created pod (when the total number of pods is still 2) but none of the remaining queued jobs seem to get picked up by these pods created as part of that scale up, but instead wait for new pods using the same deployment from the initial jobs that did run
corresponding scale up logs (click me)
2021-04-21T18:16:34.025Z DEBUG actions-runner-controller.horizontalrunnerautoscaler Calculated desired replicas of 2 {"horizontalrunnerautoscaler": "github-actions/ghe-runner-deployment-autoscaler", "suggested": 1, "reserved": 0, "min": 2, "cached": 1, "max": 20}
2021-04-21T18:16:34.025Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "horizontalrunnerautoscaler-controller", "request": "github-actions/ghe-runner-deployment-autoscaler"}
2021-04-21T18:16:34.025Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "runnerreplicaset-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn"}
2021-04-21T18:16:34.025Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "runnerdeployment-controller", "request": "github-actions/ghe-global-runner-deployment"}
2021-04-21T18:16:34.087Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "runner-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn-phgmf"}
2021-04-21T18:16:34.147Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "runner-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn-kxjgg"}
2021-04-21T18:16:34.985Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "runner-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn-kxjgg"}
2021-04-21T18:16:35.047Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "runner-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn-phgmf"}
2021-04-21T18:16:37.204Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "runnerreplicaset-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn"}
2021-04-21T18:16:37.204Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "runnerdeployment-controller", "request": "github-actions/ghe-global-runner-deployment"}
2021-04-21T18:16:44.026Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "runnerdeployment-controller", "request": "github-actions/ghe-global-runner-deployment"}
2021-04-21T18:16:44.026Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "runnerreplicaset-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn"}
2021-04-21T18:16:44.026Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "runnerreplicaset-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn"}
2021-04-21T18:16:44.096Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "runner-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn-phgmf"}
2021-04-21T18:16:44.159Z DEBUG actions-runner-controller.horizontalrunnerautoscaler Suggested desired replicas of 6 by PercentageRunnersBusy {"replicas_desired_before": 2, "replicas_desired": 6, "num_runners": 2, "num_runners_registered": 2, "num_runners_busy": 2, "namespace": "github-actions", "runner_deployment": "ghe-global-runner-deployment", "horizontal_runner_autoscaler": "ghe-runner-deployment-autoscaler", "enterprise": "[redacted]", "organization": "", "repository": ""}
2021-04-21T18:16:44.159Z DEBUG actions-runner-controller.horizontalrunnerautoscaler Calculated desired replicas of 6 {"horizontalrunnerautoscaler": "github-actions/ghe-runner-deployment-autoscaler", "suggested": 6, "reserved": 0, "min": 2, "max": 20}
2021-04-21T18:16:44.169Z DEBUG controller-runtime.webhook.webhooks received request {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "32d77506-8697-404f-a366-5804eecb6885", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2021-04-21T18:16:44.170Z DEBUG controller-runtime.webhook.webhooks wrote response {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "32d77506-8697-404f-a366-5804eecb6885", "allowed": true, "result": {}, "resultError": "got runtime.Object without object metadata: &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:,Message:,Reason:,Details:nil,Code:200,}"}
2021-04-21T18:16:44.178Z DEBUG controller-runtime.webhook.webhooks received request {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "b5d15de0-63d7-40a3-8225-7b0a8fcb7fe6", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2021-04-21T18:16:44.179Z INFO runnerdeployment-resource validate resource to be updated {"name": "ghe-global-runner-deployment"}
2021-04-21T18:16:44.179Z DEBUG controller-runtime.webhook.webhooks wrote response {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "b5d15de0-63d7-40a3-8225-7b0a8fcb7fe6", "allowed": true, "result": {}, "resultError": "got runtime.Object without object metadata: &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:,Message:,Reason:,Details:nil,Code:200,}"}
2021-04-21T18:16:44.180Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "runner-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn-kxjgg"}
2021-04-21T18:16:44.194Z DEBUG controller-runtime.webhook.webhooks received request {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "UID": "861c2f84-6aba-4e16-922e-84ae490b0190", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerreplicasets"}}
2021-04-21T18:16:44.195Z DEBUG controller-runtime.webhook.webhooks wrote response {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "UID": "861c2f84-6aba-4e16-922e-84ae490b0190", "allowed": true, "result": {}, "resultError": "got runtime.Object without object metadata: &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:,Message:,Reason:,Details:nil,Code:200,}"}
2021-04-21T18:16:44.196Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "horizontalrunnerautoscaler-controller", "request": "github-actions/ghe-runner-deployment-autoscaler"}
2021-04-21T18:16:44.197Z DEBUG actions-runner-controller.horizontalrunnerautoscaler Calculated desired replicas of 6 {"horizontalrunnerautoscaler": "github-actions/ghe-runner-deployment-autoscaler", "suggested": 6, "reserved": 0, "min": 2, "cached": 6, "last_scale_up_time": "2021-04-21 18:16:44 +0000 UTC", "scale_down_delay_until": "2021-04-21 18:16:54 +0000 UTC", "max": 20}
job that ran
Current runner version: '2.277.1'
Runner name: 'ghe-global-runner-deployment-rwtwn-kxjgg'
Runner group name: 'k8s'
Machine name: 'ghe-global-runner-deployment-rwtwn-kxjgg'
job that was queued until one of the initial jobs completed and its pod terminated
Current runner version: '2.277.1'
Runner name: 'ghe-global-runner-deployment-rwtwn-kxjgg'
Runner group name: 'k8s'
Machine name: 'ghe-global-runner-deployment-rwtwn-kxjgg'
overview
iāve confirmed this for clean installs (via helm for the controller, having deleted and re-created the CRDs manually per existing issue instructions) using both v0.18.2
as well as the canary
actions-controller image.
versioning
Name: actions-runner-controller
Namespace: actions-runner-system
CreationTimestamp: Wed, 14 Apr 2021 17:28:40 -0500
Labels: app.kubernetes.io/instance=actions-runner-controller
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=actions-runner-controller
app.kubernetes.io/version=0.18.2
helm.sh/chart=actions-runner-controller-0.11.0
Annotations: deployment.kubernetes.io/revision: 17
meta.helm.sh/release-name: actions-runner-controller
meta.helm.sh/release-namespace: actions-runner-system
Selector: app.kubernetes.io/instance=actions-runner-controller,app.kubernetes.io/name=actions-runner-controller
## controller
image:
repository: summerwind/actions-runner-controller
tag: "v0.18.2" (also w/ canary)
iāve configured the controller on the enterprise level, as we have many organizations to support, though right now itās only enabled for 2 organizations and a select number of repositories within each. given we can disable rate-limiting, i havenāt run into any issues keeping the syncPeriod
low on that front.
so far, thereās not been any issues when it comes to the runners themselves being recognized on the github side (the workers all appear in the custom k8s
group created), and when the pod(s) for a job spin up, they select the appropriate runner deployment per the label in that spec that corresponds to whatās referenced in the actions workflow itself.
the problem is it doesnāt seem that the PercentageRunnersBusy
metric iāve used to autoscale is correctly registering the number of queued jobs. in debugging the controller side, iāve confirmed there are no errors in the logs themselves. the output indicates that the desired number of replicas has been successfully reconciled; the problem is those values arenāt in line with what you would expect per the values set in scaleUpThreshold
and scaleUpFactor
that ought to be creating new pods.
controller logs
in this case i kicked off a workflow with 2 jobs. (iāve tried this using a range of syncPeriod
values from 1s
to 1m+
). logs consistently showed the following output, despite one of the jobs remaining queued, even once the first job completed.
2021-04-21T16:54:33.911Z DEBUG actions-runner-controller.horizontalrunnerautoscaler Calculated desired replicas of 1 {"horizontalrunnerautoscaler": "github-actions/ghe-runner-deployment-autoscaler", "suggested": 0, "reserved": 0, "min": 1, "cached": 0, "max": 20}
in instances when a job thatās part of a larger workflow is queued and no scale up occurs, those queued jobs seem to get lost in some sort of purgatory and remain indefinitely queued until the workflow is canceled by the end user.
hereās the outcome of a few other configuration values in the context of the 1 workflow, 4 jobs example:
scale up configs test 1
actions
- 1 workflow with 4 jobs
configs
minReplicas: 1
maxReplicas: 20
metrics:
- type: PercentageRunnersBusy
scaleUpThreshold: '0.45' # The percentage of busy runners at which the number of desired runners are re-evaluated to scale up
scaleDownThreshold: '0.3' # The percentage of busy runners at which the number of desired runners are re-evaluated to scale down
scaleUpFactor: '2.0' # The scale up multiplier factor applied to desired count
scaleDownFactor: '0.5' # The scale down multiplier factor applied to desired count
outcome
- 1 job ran successfully
- 2 pods were created, though no job ever ran on the 2nd pod
- 3 jobs remained queued
scale up configs test 2
actions
- 1 workflow with 4 jobs
configs
minReplicas: 2
maxReplicas: 20
metrics:
- type: PercentageRunnersBusy
scaleUpThreshold: '0.5' # The percentage of busy runners at which the number of desired runners are re-evaluated to scale up
scaleDownThreshold: '0.3' # The percentage of busy runners at which the number of desired runners are re-evaluated to scale down
scaleUpFactor: '3.0' # The scale up multiplier factor applied to desired count
scaleDownFactor: '0.5' # The scale down multiplier factor applied to desired count
outcome
- 2 jobs ran successfully
- 6 pods were created
- 2 remaining jobs were indefinitely queued, none of the newly created pods registered there was a queue, all shared identical logs in the
runner
container (below) - once the 2 jobs that did run completed and the 2 queued jobs were still not picked up and subsequently cancelled on the github side, the pods scaled down to 3
- on the github side, those 3 workers (and only those 3 workers) showed as āidleā self-hosted workers (expected behavior)
runner container logs >>
Starting Runner listener with startup type: service
Started listener process
Started running service
ā Connected to GitHub
2021-04-21 17:36:32Z: Listening for Jobs
further investigation notes
prior to the above tests having run, one discrepancy i did find while investigating this was there were a number of āofflineā workers displaying in the github UI that i hadnāt caught. i suspect these were left over from a prior update where i ran into an issue thatās been documented in a few other posts where a seemingly infinite scale-up occurred. none of the āghostā workers showing in the github UI were to be found when checking kubernetes.
despite those āofflineā workers on the github side, the controller logs never reflected that number of workers, so youād think these purgatory workers wouldnāt be a factor in autoscaling in that respect, but perhaps an unrelated side-effect of an upgrade. in the instance above where the logs calculated a desired replica of 1, there were 63 āofflineā workers i found at the enterprise level.
(i confirmed once i deleted the 63 offline workers the behavior re: autoscaling and queued jobs was the same. on subsequent scale downs, the terminated pods were being removed as available self-hosted runners on the github side as well, so manually deleting those on the github side seemed to resolve that particular problem)
summary of what iāve tried so far
- clean installs of
v0.18.2
with the latesthelm
charts (and manually deleting/re-creating the new CRDs), iāve also tried with thecanary
image on the controller as noted above - the manual delete/update
finalizer
approach per https://github.com/summerwind/actions-runner-controller/issues/418#issuecomment-815037707 - adjusting the scale up values by orders of magnitude
- confirmed this is an issue whether a single runner deployment with a label has been applied or two deployments with different labels. the expected behavior is there in that it only affects the deployment with the relevant label, so it doesnāt seem to be related to the total number of live pods as i initially thought it could be some sort of
selector
problem.
my best guess right now is that thereās some sort of discrepancy between how workflows and the jobs that are part of that larger workflow are registering with the controller. i donāt know if it could relate to how these webhooks are treated when the runner is set at the enterprise level versus at the organization and/or repository level?
(an additional question i had on that note is whether the github āwebhookā feature can even be enabled on an enterprise level, but perhaps that warrants a separate issue)
thank you for reading this far - please let me know if you need any further info from me!! below are some more k8s specs for reference
`RunnerDeployment` spec
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: ghe-global-runner-deployment
namespace: github-actions
spec:
template:
spec:
enterprise: xxxxxx
group: k8s
labels:
- basic
image: summerwind/actions-runner:latest
imagePullPolicy: Always
dockerdWithinRunnerContainer: false
resources:
limits:
cpu: "7.0"
memory: "7Gi"
requests:
cpu: "7.0"
memory: "7Gi"
dockerdContainerResources:
limits:
cpu: "7.0"
memory: "7Gi"
requests:
cpu: "7.0"
memory: "7Gi"
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 10
`HorizontalRunnerAutoscaler` specs
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: ghe-runner-deployment-autoscaler
namespace: github-actions
spec:
scaleDownDelaySecondsAfterScaleOut: 10
scaleTargetRef:
name: ghe-global-runner-deployment
minReplicas: 2
maxReplicas: 20
metrics:
- type: PercentageRunnersBusy
scaleUpThreshold: '0.5' # The percentage of busy runners at which the number of desired runners are re-evaluated to scale up
scaleDownThreshold: '0.3' # The percentage of busy runners at which the number of desired runners are re-evaluated to scale down
scaleUpFactor: '3.0' # The scale up multiplier factor applied to desired count
scaleDownFactor: '0.5' # The scale down multiplier factor applied to desired count
`ActionsRunnerController`
Name: actions-runner-controller
Namespace: actions-runner-system
CreationTimestamp: Wed, 14 Apr 2021 17:28:40 -0500
Labels: app.kubernetes.io/instance=actions-runner-controller
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=actions-runner-controller
app.kubernetes.io/version=0.18.2
helm.sh/chart=actions-runner-controller-0.11.0
Annotations: deployment.kubernetes.io/revision: 17
meta.helm.sh/release-name: actions-runner-controller
meta.helm.sh/release-namespace: actions-runner-system
Selector: app.kubernetes.io/instance=actions-runner-controller,app.kubernetes.io/name=actions-runner-controller
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app.kubernetes.io/instance=actions-runner-controller
app.kubernetes.io/name=actions-runner-controller
Annotations: kubectl.kubernetes.io/restartedAt: 2021-04-20T17:57:51-05:00
Service Account: actions-runner-controller
Containers:
manager:
Image: summerwind/actions-runner-controller:canary
Port: 9443/TCP
Host Port: 0/TCP
Command:
/manager
Args:
--metrics-addr=127.0.0.1:8080
--enable-leader-election
--sync-period=10s
--docker-image=docker:dind
Environment:
GITHUB_TOKEN: <set to the key 'github_token' in secret 'controller-manager'> Optional: false
GITHUB_ENTERPRISE_URL: [redacted]
Mounts:
/etc/actions-runner-controller from secret (ro)
/tmp from tmp (rw)
/tmp/k8s-webhook-server/serving-certs from cert (ro)
kube-rbac-proxy:
Image: quay.io/brancz/kube-rbac-proxy:v0.8.0
Port: 8443/TCP
Host Port: 0/TCP
Args:
--secure-listen-address=0.0.0.0:8443
--upstream=http://127.0.0.1:8080/
--logtostderr=true
--v=10
Environment: <none>
Mounts: <none>
Volumes:
secret:
Type: Secret (a volume populated by a Secret)
SecretName: controller-manager
Optional: false
cert:
Type: Secret (a volume populated by a Secret)
SecretName: webhook-server-cert
Optional: false
tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: actions-runner-controller-7d9467c88 (1/1 replicas created)
Events: <none>
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 36 (13 by maintainers)
Rather than let you all speculate in this issue, Iāll just go ahead and put some rough plans in here. We plan to ship most of this by the end of the summer and we plan to also ship it in GHES after we test it on dotcom.
Stuff that should land in June-ish:
Will probably take longer due to figuring out API contracts and performance testing:
I"m happy to work with you all to test these out before they are āofficiallyā available so you can update the solution. We think this is a lot of good stuff to enable auto-scaling runners. If you have feedback, want to get in touch about implementation or think there are more things we should do in this area you can email me using my handle at github.
just wanted to post here to let you all know we just officially launched ephemeral: https://github.blog/changelog/2021-09-20-github-actions-ephemeral-self-hosted-runners-new-webhooks-for-auto-scaling/
š We were able to ship the assignment changes last week, so you should now see jobs get picked up at all levels even if you register a runner after a job was scheduled (including organization level).
Make sure you check your runner group access permissions when you debug this.
We are working on the webhooks as we speak and still expect to ship ephemeral soon (September time frame).
@hross really excited for when these changes to land!
I just wanted to highlight https://github.com/actions-runner-controller/actions-runner-controller/issues/642 to you as weāve had a number of people request the ability to scale their runner counts up based on label from webhooks. At the moment the payloads donāt include the information we would need to consider doing this in the project.
EDIT @hross additionally, apologies for bothering but are you able to tell us if we are on track for those June-ish features? Iām especially interested in this one:
Hereās what Iāve posted in the github community forum šš» -> https://github.community/t/bug-self-hosted-runners-at-the-enterprise-level-fail-to-detect-queued-jobs/176348
Given the time horizon provided by GitHub is only a few weeks for the fix to be shipped and then I think realistically it could take up to a month for it to be fully tested on their end before general rollout, I personally wouldnāt want to start introducing code that is only there as a temporary fix unless it added broader long term value to the project,
Additionally, correct me if Iām wrong @kathleenfrench but her main driver for wanting enterprise scaling is because doing it at the organisation level is not realistic in her environment due to the number of organisations that she needs to support. I donāt think introducing more flex around how the things scale would do much for her as an end user.
I think this is worth doing with the current value as the default. Considering Github Enterprise Server lets the adminstrator control the rate limiting behaviour and the solution supports that environment it would be handy to be able to bring that value right down. Once Github patch their end beign able to lower this value would still be very beneficial in a Github Enterprise Server environment.
@kathleenfrench thanks for the great response as usual.
Yes I took that to mean they are patching their job assignment logic rather than expanding out the events at the enterprise level. I assume the latter is much more involved and has implications around their infrastructure and so would not be part of the fix they are applying. Hopefully they eventually expand out the events in future though!
Have you seen the new scheduling feature @mumoshu has managed to put together? Would this cover your need to run a larger amount of min replicas during peak hours and then scale back duing non peak/non-business hours? See https://github.com/actions-runner-controller/actions-runner-controller/issues/484 to track the details.
@mumoshu also itās probably worth putting the
pinned
label on this issue until GitHub confirm they have deployed their fix here https://github.community/t/bug-self-hosted-runners-at-the-enterprise-level-fail-to-detect-queued-jobs/176348For posterityās sake, I raised this with our Enterprise support
Unfortunately it did not make the cutoff for 3.2 release cycle and we will ship it in 3.3.
@kathleenfrench @callum-tait-pbx Hey! Sorry for taking so long to respond to this great thread.
I just read https://github.community/t/bug-self-hosted-runners-at-the-enterprise-level-fail-to-detect-queued-jobs/176348 and GitHub is going to fix the issue on their end. Good.
Until the fix lands, the best we can do in terms of autoscaling right now would be to do update HRA to add CapacityReservation to add runners for a specific period of time. If your workflow runs/jobs are being run on schedule/timings are predictable, this workaround would work.
Another possible solution would be āscheduled scalingā feature weāre discussing in #484. It would allow you to have more runners in business hours and fewer runners in other hours. The scaling isnāt as dynamic as it would be with PercentageRunnersBusy though.
Regarding
CacheDuration
, Iām open to making it more customizable.What if we had another flag
--github-api-cache-duration
to customize it, so that you can freely choose whatever--sync-period
and--github-api-cache-duration
values you want?Itās a bit of a weird one isnāt it!
Potentially we could just remove the subtraction entirely and just check the the value provided is >= 1s and just default to the fail safe 10m figure if the input is bad or someone tries to do a sub-second sync period. The solution has improve greatly since its first inception, originally the pull based
TotalNumberOfQueuedAndInProgressWorkflowRuns
metric was the only means for scaling, this metric requires a lot of API calls to maintain and so it wasnāt really possible to run a very low sync period. Since then thePercentageRunnersBusy
metric has been added which uses far fewer API calls to maintain as well as a webhook option which doesnāt rely on lots of API calls to begin with. Weāve also added the ability to run multiple controllers since then too to allow very large scale setups without getting rate limited.The default 10 minutes for a sync period was chosen just to stop someone rate limiting themselves out of the box, there isnāt anything particular about why itās 10 minutes beyond that reason.
Alternatively if we feel 1s is too low and you are extremely likely to rate limit yourself then we could keep the check at the 10s threshold (or maybe only lower it to 5s) and if the sync period provided is below that threshold just default to the lowest accepted value instead of the fail safe and spit out some log messages so it is debugable. The logic being the end user has actively chosen a short sync period so we are assuming they understand they may get rate limited.
Obviously with you being on the server edition enabling you to administor your rate limit configuration none of this is relevant for your environment! š
cc @mumoshu what are your thoughts?
no problem!
Sounds like it works! The
Runner
kind is for deploying a single runner rather than sets. It doesnāt support being scaled by ahorizontalrunnerautoscaler
. Iām not sure how much use it is at an enterprise level really tbh but I was pretty sure it worked I just needed someone to test it for me š. With that confirmed the docs can read and flow better.https://github.com/actions/runner/issues/1059 interesting, it looks like the problem lies with the runner routing / queueing service on githubās end.
For the moment then it looks like enterprise level autoscaling isnāt possible until GitHub improve their queueing / routing service, lame. If they do end up improving their service the code as it stands we may already be able to take advantage of feature as is.
@kathleenfrench if you could try deploying a
Runner
kind and see if it can be consumed by repositories in organisations that would be great! We can then update the docs with all this great information youāve managed to discover!In the meantime if you raise something on https://github.community/c/code-to-cloud/github-actions/41 as advised by github we can link it to an issue here so we can keep track of it.