actions-runner-controller: enterprise autoscaling issues, indefinitely queued jobs within workflows

hey - first off, thanks for all of your work on this šŸ˜„ - itā€™s a great project!

apologies in advance for the lengthy note ā€“ just hoping to provide as much helpful info as possible upfront. nevertheless, a (poor) attempt at the cliffsnotes version:

when a workflow with more jobs than currently-running runner pods is kicked off, sometimes the only jobs that ever run - even if a scale up is later triggered (though that scale-up is never triggered within the scaleUp period) - are those that are initially picked up. however, sometimes the remaining queued jobs do run, but only after the first jobs complete. there are no errors in the controller logs in either instance.

scenario

1 workflow with 4 jobs 10s syncPeriod

minReplicas: 2 maxReplicas: 20 scaleUpThreshold: 0.5 scaleUpFactor: 3.0

expected behavior

  • 2 jobs immediately picked up by the 2 listening pods
  • trigger a scale up after the 10s syncPeriod completes as > 0.5 of the running pods are now ā€˜busyā€™
  • this will result in 6 runner pods
  • the remaining 2 jobs in the workflow would get picked up by the newly created runner pods and run alongside the 2 already running

actual behavior

  • 2 jobs immediately picked up by the 2 listening pods
  • no scale up after 10s syncPeriod - controller logs show:
2021-04-21T18:12:14.019Z        DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Calculated desired replicas of 2      {"horizontalrunnerautoscaler": "github-actions/ghe-runner-deployment-autoscaler", "suggested": 1, "reserved": 0, "min": 2, "cached": 1, "max": 20}
  • once a job completes its pod terminates, a new pod is created, and a queued job runs. at this point, no scale up has been triggered yet
  • scale up to 6 pods seems to coincide subsequent to the 3rd job starting on a newly created pod (when the total number of pods is still 2) but none of the remaining queued jobs seem to get picked up by these pods created as part of that scale up, but instead wait for new pods using the same deployment from the initial jobs that did run
corresponding scale up logs (click me)
2021-04-21T18:16:34.025Z        DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Calculated desired replicas of 2      {"horizontalrunnerautoscaler": "github-actions/ghe-runner-deployment-autoscaler", "suggested": 1, "reserved": 0, "min": 2, "cached": 1, "max": 20}
2021-04-21T18:16:34.025Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "horizontalrunnerautoscaler-controller", "request": "github-actions/ghe-runner-deployment-autoscaler"}
2021-04-21T18:16:34.025Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "runnerreplicaset-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn"}
2021-04-21T18:16:34.025Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "runnerdeployment-controller", "request": "github-actions/ghe-global-runner-deployment"}
2021-04-21T18:16:34.087Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "runner-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn-phgmf"}
2021-04-21T18:16:34.147Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "runner-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn-kxjgg"}
2021-04-21T18:16:34.985Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "runner-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn-kxjgg"}
2021-04-21T18:16:35.047Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "runner-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn-phgmf"}
2021-04-21T18:16:37.204Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "runnerreplicaset-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn"}
2021-04-21T18:16:37.204Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "runnerdeployment-controller", "request": "github-actions/ghe-global-runner-deployment"}
2021-04-21T18:16:44.026Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "runnerdeployment-controller", "request": "github-actions/ghe-global-runner-deployment"}
2021-04-21T18:16:44.026Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "runnerreplicaset-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn"}
2021-04-21T18:16:44.026Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "runnerreplicaset-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn"}
2021-04-21T18:16:44.096Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "runner-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn-phgmf"}
2021-04-21T18:16:44.159Z        DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Suggested desired replicas of 6 by PercentageRunnersBusy       {"replicas_desired_before": 2, "replicas_desired": 6, "num_runners": 2, "num_runners_registered": 2, "num_runners_busy": 2, "namespace": "github-actions", "runner_deployment": "ghe-global-runner-deployment", "horizontal_runner_autoscaler": "ghe-runner-deployment-autoscaler", "enterprise": "[redacted]", "organization": "", "repository": ""}
2021-04-21T18:16:44.159Z        DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Calculated desired replicas of 6      {"horizontalrunnerautoscaler": "github-actions/ghe-runner-deployment-autoscaler", "suggested": 6, "reserved": 0, "min": 2, "max": 20}
2021-04-21T18:16:44.169Z        DEBUG   controller-runtime.webhook.webhooks     received request        {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "32d77506-8697-404f-a366-5804eecb6885", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2021-04-21T18:16:44.170Z        DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "32d77506-8697-404f-a366-5804eecb6885", "allowed": true, "result": {}, "resultError": "got runtime.Object without object metadata: &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:,Message:,Reason:,Details:nil,Code:200,}"}
2021-04-21T18:16:44.178Z        DEBUG   controller-runtime.webhook.webhooks     received request        {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "b5d15de0-63d7-40a3-8225-7b0a8fcb7fe6", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2021-04-21T18:16:44.179Z        INFO    runnerdeployment-resource       validate resource to be updated {"name": "ghe-global-runner-deployment"}
2021-04-21T18:16:44.179Z        DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "b5d15de0-63d7-40a3-8225-7b0a8fcb7fe6", "allowed": true, "result": {}, "resultError": "got runtime.Object without object metadata: &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:,Message:,Reason:,Details:nil,Code:200,}"}
2021-04-21T18:16:44.180Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "runner-controller", "request": "github-actions/ghe-global-runner-deployment-rwtwn-kxjgg"}
2021-04-21T18:16:44.194Z        DEBUG   controller-runtime.webhook.webhooks     received request        {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "UID": "861c2f84-6aba-4e16-922e-84ae490b0190", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerreplicasets"}}
2021-04-21T18:16:44.195Z        DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "UID": "861c2f84-6aba-4e16-922e-84ae490b0190", "allowed": true, "result": {}, "resultError": "got runtime.Object without object metadata: &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:,Message:,Reason:,Details:nil,Code:200,}"}
2021-04-21T18:16:44.196Z        DEBUG   controller-runtime.controller   Successfully Reconciled {"controller": "horizontalrunnerautoscaler-controller", "request": "github-actions/ghe-runner-deployment-autoscaler"}
2021-04-21T18:16:44.197Z        DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Calculated desired replicas of 6      {"horizontalrunnerautoscaler": "github-actions/ghe-runner-deployment-autoscaler", "suggested": 6, "reserved": 0, "min": 2, "cached": 6, "last_scale_up_time": "2021-04-21 18:16:44 +0000 UTC", "scale_down_delay_until": "2021-04-21 18:16:54 +0000 UTC", "max": 20}

job that ran

Current runner version: '2.277.1'
Runner name: 'ghe-global-runner-deployment-rwtwn-kxjgg'
Runner group name: 'k8s'
Machine name: 'ghe-global-runner-deployment-rwtwn-kxjgg'

job that was queued until one of the initial jobs completed and its pod terminated

Current runner version: '2.277.1'
Runner name: 'ghe-global-runner-deployment-rwtwn-kxjgg'
Runner group name: 'k8s'
Machine name: 'ghe-global-runner-deployment-rwtwn-kxjgg'

overview

iā€™ve confirmed this for clean installs (via helm for the controller, having deleted and re-created the CRDs manually per existing issue instructions) using both v0.18.2 as well as the canary actions-controller image.

versioning

Name:                   actions-runner-controller
Namespace:              actions-runner-system
CreationTimestamp:      Wed, 14 Apr 2021 17:28:40 -0500
Labels:                 app.kubernetes.io/instance=actions-runner-controller
                        app.kubernetes.io/managed-by=Helm
                        app.kubernetes.io/name=actions-runner-controller
                        app.kubernetes.io/version=0.18.2
                        helm.sh/chart=actions-runner-controller-0.11.0
Annotations:            deployment.kubernetes.io/revision: 17
                        meta.helm.sh/release-name: actions-runner-controller
                        meta.helm.sh/release-namespace: actions-runner-system
Selector:               app.kubernetes.io/instance=actions-runner-controller,app.kubernetes.io/name=actions-runner-controller
## controller
image:
  repository: summerwind/actions-runner-controller
   tag: "v0.18.2" (also w/ canary)

iā€™ve configured the controller on the enterprise level, as we have many organizations to support, though right now itā€™s only enabled for 2 organizations and a select number of repositories within each. given we can disable rate-limiting, i havenā€™t run into any issues keeping the syncPeriod low on that front.

so far, thereā€™s not been any issues when it comes to the runners themselves being recognized on the github side (the workers all appear in the custom k8s group created), and when the pod(s) for a job spin up, they select the appropriate runner deployment per the label in that spec that corresponds to whatā€™s referenced in the actions workflow itself.

the problem is it doesnā€™t seem that the PercentageRunnersBusy metric iā€™ve used to autoscale is correctly registering the number of queued jobs. in debugging the controller side, iā€™ve confirmed there are no errors in the logs themselves. the output indicates that the desired number of replicas has been successfully reconciled; the problem is those values arenā€™t in line with what you would expect per the values set in scaleUpThreshold and scaleUpFactor that ought to be creating new pods.

controller logs

in this case i kicked off a workflow with 2 jobs. (iā€™ve tried this using a range of syncPeriod values from 1s to 1m+). logs consistently showed the following output, despite one of the jobs remaining queued, even once the first job completed.

2021-04-21T16:54:33.911Z        DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Calculated desired replicas of 1      {"horizontalrunnerautoscaler": "github-actions/ghe-runner-deployment-autoscaler", "suggested": 0, "reserved": 0, "min": 1, "cached": 0, "max": 20}

in instances when a job thatā€™s part of a larger workflow is queued and no scale up occurs, those queued jobs seem to get lost in some sort of purgatory and remain indefinitely queued until the workflow is canceled by the end user.

hereā€™s the outcome of a few other configuration values in the context of the 1 workflow, 4 jobs example:

scale up configs test 1

actions

  • 1 workflow with 4 jobs

configs

  minReplicas: 1
  maxReplicas: 20
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.45'    # The percentage of busy runners at which the number of desired runners are re-evaluated to scale up
    scaleDownThreshold: '0.3'   # The percentage of busy runners at which the number of desired runners are re-evaluated to scale down
    scaleUpFactor: '2.0'        # The scale up multiplier factor applied to desired count
    scaleDownFactor: '0.5'      # The scale down multiplier factor applied to desired count

outcome

  • 1 job ran successfully
  • 2 pods were created, though no job ever ran on the 2nd pod
  • 3 jobs remained queued

scale up configs test 2

actions

  • 1 workflow with 4 jobs

configs

  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.5'    # The percentage of busy runners at which the number of desired runners are re-evaluated to scale up
    scaleDownThreshold: '0.3'   # The percentage of busy runners at which the number of desired runners are re-evaluated to scale down
    scaleUpFactor: '3.0'        # The scale up multiplier factor applied to desired count
    scaleDownFactor: '0.5'      # The scale down multiplier factor applied to desired count

outcome

  • 2 jobs ran successfully
  • 6 pods were created
  • 2 remaining jobs were indefinitely queued, none of the newly created pods registered there was a queue, all shared identical logs in the runner container (below)
  • once the 2 jobs that did run completed and the 2 queued jobs were still not picked up and subsequently cancelled on the github side, the pods scaled down to 3
  • on the github side, those 3 workers (and only those 3 workers) showed as ā€˜idleā€™ self-hosted workers (expected behavior)

runner container logs >>

Starting Runner listener with startup type: service
Started listener process
Started running service

āˆš Connected to GitHub

2021-04-21 17:36:32Z: Listening for Jobs

further investigation notes

prior to the above tests having run, one discrepancy i did find while investigating this was there were a number of ā€˜offlineā€™ workers displaying in the github UI that i hadnā€™t caught. i suspect these were left over from a prior update where i ran into an issue thatā€™s been documented in a few other posts where a seemingly infinite scale-up occurred. none of the ā€˜ghostā€™ workers showing in the github UI were to be found when checking kubernetes.

despite those ā€˜offlineā€™ workers on the github side, the controller logs never reflected that number of workers, so youā€™d think these purgatory workers wouldnā€™t be a factor in autoscaling in that respect, but perhaps an unrelated side-effect of an upgrade. in the instance above where the logs calculated a desired replica of 1, there were 63 ā€˜offlineā€™ workers i found at the enterprise level.

(i confirmed once i deleted the 63 offline workers the behavior re: autoscaling and queued jobs was the same. on subsequent scale downs, the terminated pods were being removed as available self-hosted runners on the github side as well, so manually deleting those on the github side seemed to resolve that particular problem)

summary of what iā€™ve tried so far

  • clean installs of v0.18.2 with the latest helm charts (and manually deleting/re-creating the new CRDs), iā€™ve also tried with the canary image on the controller as noted above
  • the manual delete/update finalizer approach per https://github.com/summerwind/actions-runner-controller/issues/418#issuecomment-815037707
  • adjusting the scale up values by orders of magnitude
  • confirmed this is an issue whether a single runner deployment with a label has been applied or two deployments with different labels. the expected behavior is there in that it only affects the deployment with the relevant label, so it doesnā€™t seem to be related to the total number of live pods as i initially thought it could be some sort of selector problem.

my best guess right now is that thereā€™s some sort of discrepancy between how workflows and the jobs that are part of that larger workflow are registering with the controller. i donā€™t know if it could relate to how these webhooks are treated when the runner is set at the enterprise level versus at the organization and/or repository level?

(an additional question i had on that note is whether the github ā€˜webhookā€™ feature can even be enabled on an enterprise level, but perhaps that warrants a separate issue)

thank you for reading this far - please let me know if you need any further info from me!! below are some more k8s specs for reference

`RunnerDeployment` spec
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: ghe-global-runner-deployment
  namespace: github-actions
spec:
  template:
    spec:
      enterprise: xxxxxx
      group: k8s
      labels:
        - basic
      image: summerwind/actions-runner:latest
      imagePullPolicy: Always
      dockerdWithinRunnerContainer: false
      resources:
        limits:
          cpu: "7.0"
          memory: "7Gi"
        requests:
          cpu: "7.0"
          memory: "7Gi"
      dockerdContainerResources:
        limits:
          cpu: "7.0"
          memory: "7Gi"
        requests:
          cpu: "7.0"
          memory: "7Gi"
      tolerations:
        - key: "node.kubernetes.io/unreachable"
          operator: "Exists"
          effect: "NoExecute"
          tolerationSeconds: 10
`HorizontalRunnerAutoscaler` specs
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: ghe-runner-deployment-autoscaler
  namespace: github-actions
spec:
  scaleDownDelaySecondsAfterScaleOut: 10
  scaleTargetRef:
    name: ghe-global-runner-deployment
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.5'    # The percentage of busy runners at which the number of desired runners are re-evaluated to scale up
    scaleDownThreshold: '0.3'   # The percentage of busy runners at which the number of desired runners are re-evaluated to scale down
    scaleUpFactor: '3.0'        # The scale up multiplier factor applied to desired count
    scaleDownFactor: '0.5'      # The scale down multiplier factor applied to desired count
`ActionsRunnerController`
Name:                   actions-runner-controller
Namespace:              actions-runner-system
CreationTimestamp:      Wed, 14 Apr 2021 17:28:40 -0500
Labels:                 app.kubernetes.io/instance=actions-runner-controller
                        app.kubernetes.io/managed-by=Helm
                        app.kubernetes.io/name=actions-runner-controller
                        app.kubernetes.io/version=0.18.2
                        helm.sh/chart=actions-runner-controller-0.11.0
Annotations:            deployment.kubernetes.io/revision: 17
                        meta.helm.sh/release-name: actions-runner-controller
                        meta.helm.sh/release-namespace: actions-runner-system
Selector:               app.kubernetes.io/instance=actions-runner-controller,app.kubernetes.io/name=actions-runner-controller
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app.kubernetes.io/instance=actions-runner-controller
                    app.kubernetes.io/name=actions-runner-controller
  Annotations:      kubectl.kubernetes.io/restartedAt: 2021-04-20T17:57:51-05:00
  Service Account:  actions-runner-controller
  Containers:
   manager:
    Image:      summerwind/actions-runner-controller:canary
    Port:       9443/TCP
    Host Port:  0/TCP
    Command:
      /manager
    Args:
      --metrics-addr=127.0.0.1:8080
      --enable-leader-election
      --sync-period=10s
      --docker-image=docker:dind
    Environment:
      GITHUB_TOKEN:           <set to the key 'github_token' in secret 'controller-manager'>  Optional: false
      GITHUB_ENTERPRISE_URL:  [redacted]
    Mounts:
      /etc/actions-runner-controller from secret (ro)
      /tmp from tmp (rw)
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
   kube-rbac-proxy:
    Image:      quay.io/brancz/kube-rbac-proxy:v0.8.0
    Port:       8443/TCP
    Host Port:  0/TCP
    Args:
      --secure-listen-address=0.0.0.0:8443
      --upstream=http://127.0.0.1:8080/
      --logtostderr=true
      --v=10
    Environment:  <none>
    Mounts:       <none>
  Volumes:
   secret:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  controller-manager
    Optional:    false
   cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  webhook-server-cert
    Optional:    false
   tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   actions-runner-controller-7d9467c88 (1/1 replicas created)
Events:          <none>

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 36 (13 by maintainers)

Commits related to this issue

Most upvoted comments

Rather than let you all speculate in this issue, Iā€™ll just go ahead and put some rough plans in here. We plan to ship most of this by the end of the summer and we plan to also ship it in GHES after we test it on dotcom.

Stuff that should land in June-ish:

  • Fix job assignment to allow it at multiple levels
  • Change job failure timeouts so that runs wonā€™t fail if they donā€™t find an existing runner (online or offline) with a label that matches. This allows you to do on-demand spin up rather than register a ā€œfakeā€ runner.
  • Enable the ephemeral flag to truly guarantee run once jobs. We can land this after the above changes.

Will probably take longer due to figuring out API contracts and performance testing:

  • Add webhooks similar to check runs ā€“ queue, running, completed ā€“ (except workflow specific) that flow through the org, repo and enterprise levels. This would also include attaching labels to the events and adding a durable job ID so you can reconcile missed events or issues with your scaling solution. Possibly runner group information (but maybe add that later). I think this is exactly what youā€™re talking about above šŸ˜„

I"m happy to work with you all to test these out before they are ā€œofficiallyā€ available so you can update the solution. We think this is a lot of good stuff to enable auto-scaling runners. If you have feedback, want to get in touch about implementation or think there are more things we should do in this area you can email me using my handle at github.

just wanted to post here to let you all know we just officially launched ephemeral: https://github.blog/changelog/2021-09-20-github-actions-ephemeral-self-hosted-runners-new-webhooks-for-auto-scaling/

šŸ‘‹ We were able to ship the assignment changes last week, so you should now see jobs get picked up at all levels even if you register a runner after a job was scheduled (including organization level).

Make sure you check your runner group access permissions when you debug this.

We are working on the webhooks as we speak and still expect to ship ephemeral soon (September time frame).

@hross really excited for when these changes to land!

I just wanted to highlight https://github.com/actions-runner-controller/actions-runner-controller/issues/642 to you as weā€™ve had a number of people request the ability to scale their runner counts up based on label from webhooks. At the moment the payloads donā€™t include the information we would need to consider doing this in the project.

EDIT @hross additionally, apologies for bothering but are you able to tell us if we are on track for those June-ish features? Iā€™m especially interested in this one:

Change job failure timeouts so that runs wonā€™t fail if they donā€™t find an existing runner (online or offline) with a label that matches. This allows you to do on-demand spin up rather than register a ā€œfakeā€ runner.

Hereā€™s what Iā€™ve posted in the github community forum šŸ‘šŸ» -> https://github.community/t/bug-self-hosted-runners-at-the-enterprise-level-fail-to-detect-queued-jobs/176348

We are planning to ship a fix for this behavior in the next few weeks. Given that it is job assignment logic, though, it might take us some time to test and verify the behavior as we roll it out (breaking job assignment would be pretty bad šŸ˜„.

Until the fix lands, the best we can do in terms of autoscaling right now would be to do update HRA to add CapacityReservation to add runners for a specific period of time. If youā€™re workflow runs/jobs are being run on schedule/timings are predictable, this workaround would work.

Given the time horizon provided by GitHub is only a few weeks for the fix to be shipped and then I think realistically it could take up to a month for it to be fully tested on their end before general rollout, I personally wouldnā€™t want to start introducing code that is only there as a temporary fix unless it added broader long term value to the project,

iā€™ve configured the controller on the enterprise level, as we have many organizations to support

Additionally, correct me if Iā€™m wrong @kathleenfrench but her main driver for wanting enterprise scaling is because doing it at the organisation level is not realistic in her environment due to the number of organisations that she needs to support. I donā€™t think introducing more flex around how the things scale would do much for her as an end user.

Regarding CacheDuration, Iā€™m open to making it more customizable.

I think this is worth doing with the current value as the default. Considering Github Enterprise Server lets the adminstrator control the rate limiting behaviour and the solution supports that environment it would be handy to be able to bring that value right down. Once Github patch their end beign able to lower this value would still be very beneficial in a Github Enterprise Server environment.

@kathleenfrench thanks for the great response as usual.

i wonder if part of githubā€™s work on improving their job queue/autoscaling at the enterprise level will include expanding the scope of enterprise-level webhooks as they pertain to running actions. i suspect thatā€™s not part of the near-future roadmap, though, as iā€™d speculate itā€™s a larger undertaking given the necessary changes to the existing API.

Yes I took that to mean they are patching their job assignment logic rather than expanding out the events at the enterprise level. I assume the latter is much more involved and has implications around their infrastructure and so would not be part of the fix they are applying. Hopefully they eventually expand out the events in future though!

as every job spawns its own runner pods, and a single workflow can include so many jobs, when youā€™re running at the enterprise level the scale up factor may have to be higher than it would otherwise be at a repository/org-level, but you wouldnā€™t necessarily want that to be the behavior for jobs running at non-peak traffic hours/non-business hours. that might be a bit too in the weeds as a feature for enterprise users, but itā€™s an idea.

Have you seen the new scheduling feature @mumoshu has managed to put together? Would this cover your need to run a larger amount of min replicas during peak hours and then scale back duing non peak/non-business hours? See https://github.com/actions-runner-controller/actions-runner-controller/issues/484 to track the details.

@mumoshu also itā€™s probably worth putting the pinned label on this issue until GitHub confirm they have deployed their fix here https://github.community/t/bug-self-hosted-runners-at-the-enterprise-level-fail-to-detect-queued-jobs/176348

For posterityā€™s sake, I raised this with our Enterprise support

image

Unfortunately it did not make the cutoff for 3.2 release cycle and we will ship it in 3.3.

@kathleenfrench @callum-tait-pbx Hey! Sorry for taking so long to respond to this great thread.

I just read https://github.community/t/bug-self-hosted-runners-at-the-enterprise-level-fail-to-detect-queued-jobs/176348 and GitHub is going to fix the issue on their end. Good.

Until the fix lands, the best we can do in terms of autoscaling right now would be to do update HRA to add CapacityReservation to add runners for a specific period of time. If your workflow runs/jobs are being run on schedule/timings are predictable, this workaround would work.

Another possible solution would be ā€œscheduled scalingā€ feature weā€™re discussing in #484. It would allow you to have more runners in business hours and fewer runners in other hours. The scaling isnā€™t as dynamic as it would be with PercentageRunnersBusy though.

Regarding CacheDuration, Iā€™m open to making it more customizable.

What if we had another flag --github-api-cache-duration to customize it, so that you can freely choose whatever --sync-period and --github-api-cache-duration values you want?

it seems like there must be some sort of message queue retry or caching problem on the job router side of things, which is odd because githubā€™s own API accurately reflects the state of available, online runners while these jobs remain stuck in the queue.

Itā€™s a bit of a weird one isnā€™t it!

as for the CacheDuration change I noted above, shaving off those extra seconds probably wouldnā€™t be utilized by a great deal of end-users, but perhaps offering a way to set that subtracted value instead of the 10s default would be worthwhile? or at the very least, ensuring that if the syncPeriod is lower than the default CacheDuration that it wonā€™t return a negative value (which, as it currently stands, results in falling back to a 10m cache interval).

Potentially we could just remove the subtraction entirely and just check the the value provided is >= 1s and just default to the fail safe 10m figure if the input is bad or someone tries to do a sub-second sync period. The solution has improve greatly since its first inception, originally the pull based TotalNumberOfQueuedAndInProgressWorkflowRuns metric was the only means for scaling, this metric requires a lot of API calls to maintain and so it wasnā€™t really possible to run a very low sync period. Since then the PercentageRunnersBusy metric has been added which uses far fewer API calls to maintain as well as a webhook option which doesnā€™t rely on lots of API calls to begin with. Weā€™ve also added the ability to run multiple controllers since then too to allow very large scale setups without getting rate limited.

The default 10 minutes for a sync period was chosen just to stop someone rate limiting themselves out of the box, there isnā€™t anything particular about why itā€™s 10 minutes beyond that reason.

Alternatively if we feel 1s is too low and you are extremely likely to rate limit yourself then we could keep the check at the 10s threshold (or maybe only lower it to 5s) and if the sync period provided is below that threshold just default to the lowest accepted value instead of the fail safe and spit out some log messages so it is debugable. The logic being the end user has actively chosen a short sync period so we are assuming they understand they may get rate limited.

Obviously with you being on the server edition enabling you to administor your rate limit configuration none of this is relevant for your environment! šŸ˜„

cc @mumoshu what are your thoughts?

@callum-tait-pbx I didnā€™t catch that the Runner could be set to enterprise, sorry I didnā€™t realize thatā€™s what you meant!

no problem!

just tried that and if I add a Runner (enterprise-level) it appears to suffer from the same problem in that only those workflows/jobs immediately allocated to a runner get picked up, but they do work in that case./

Sounds like it works! The Runner kind is for deploying a single runner rather than sets. It doesnā€™t support being scaled by a horizontalrunnerautoscaler. Iā€™m not sure how much use it is at an enterprise level really tbh but I was pretty sure it worked I just needed someone to test it for me šŸ˜„. With that confirmed the docs can read and flow better.

https://github.com/actions/runner/issues/1059 interesting, it looks like the problem lies with the runner routing / queueing service on githubā€™s end.

For the moment then it looks like enterprise level autoscaling isnā€™t possible until GitHub improve their queueing / routing service, lame. If they do end up improving their service the code as it stands we may already be able to take advantage of feature as is.

@kathleenfrench if you could try deploying a Runner kind and see if it can be consumed by repositories in organisations that would be great! We can then update the docs with all this great information youā€™ve managed to discover!

In the meantime if you raise something on https://github.community/c/code-to-cloud/github-actions/41 as advised by github we can link it to an issue here so we can keep track of it.