actions-runner-controller: Metric TotalNumberOfQueuedAndInProgressWorkflowRuns doesn't seem to line up with Current workflow/job counts

Describe the bug We are still battling the ‘cancellation’ bug in our organization, many times running into pod reconciles at times where it doesn’t make sense. Typically the scale up works (tho since 0.20.1 its been scaling up tremendously slowly, if it even scales up to the proper amount of runners needed - more in details), but scaling down happens prematurely and cancels active workflow jobs. I have been able to test the 0.20.4 version but cancellations occur even more often than not with that release, and that includes being capped at 24 runners for some odd reason. I started monitoring the suggested replicas and am seeing a lot of oddities, which I am not yet able to map to any specific part of the code…

Checks

Example Log from Controller:

2022-01-18T19:15:05.548Z	DEBUG	actions-runner-controller.horizontalrunnerautoscaler	Suggested desired replicas of 8 by TotalNumberOfQueuedAndInProgressWorkflowRuns	{"workflow_runs_completed": 0, "workflow_runs_in_progress": 8, "workflow_runs_queued": 0, "workflow_runs_unknown": 0, "namespace": "default", "kind": "runnerdeployment", "name": "runners", "horizontal_runner_autoscaler": "runners-autoscaler"}

Example from CLI written w/ same code to grab workflow runs/jobs:

./airbin/checkctl github list_jobs
2022/01/18 13:15:08 [DEBUG] Total Count: 1 for Perform release (queued)
-----workflow stats for 1713612691-----Jobs Completed: 65Jobs Queued: 18Jobs In Progress: 46Jobs Unknown: 0

To Reproduce Steps to reproduce the behavior:

Use TotalNumberOfQueuedAndInProgressWorkflowRuns
Use deployment from screenshots below
Launch workflow with 100+ jobs (in matrix if possible)

Expected behavior A clear and concise description of what you expected to happen.

Screenshots

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  namespace: default
  name: runners
spec:
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    spec:
      dockerdWithinRunnerContainer: true
      organization: myOrganization
      nodeSelector:
        com.example/capacity-type: SPOT
        com.example/purpose: github-runners
        com.example/environment: ops
      labels:
        - java
        - linux
        - eks
        - self-hosted
      resources:
        requests:
          cpu: "1.0"
          memory: "10Gi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  namespace: default
  name: runners-autoscaler
spec:
  scaleTargetRef:
    name:runners
  scaleDownDelaySecondsAfterScaleOut: 7200
  minReplicas: 1
  maxReplicas: 150
  metrics:
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - my_repository

Environment (please complete the following information):

Controller Version [0.20.1->0.20.4]
Deployment Method: Helm

Additional context

This has been an ongoing issue but since the upgrade to the 0.20.x branch I believe its gotten worse. I am not sure if its the GitHub API or something in the calculations to determine the suggested replicas. Either way, is there anything I am missing here? I am not sure why I can get such different results on my own GitHub API calls from command line vs that of the controller’s autoscaler suggested replicas.

scaleDownDelaySecondsAfterScaleOut: 7200 was put in to battle some of the unnecessary kills, it isn’t working as well though because after a scale up and the runners stay at scale, it eventually causes runners to get interrupted (by controller kills) during the middle of tests as well.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 27 (1 by maintainers)

Commits related to this issue

doc: Enhance troubleshooting guide with the scale-to-zero issue Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1057#issuecomment-1133439061 — committed to actions/actions-runner-controller by mumoshu 2 years ago
doc: enhance troubleshooting guide with the scale-to-zero issue (#1469) Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1057#issuecomment-1133439061 — committed to actions/actions-runner-controller by mumoshu 2 years ago

Most upvoted comments

Thank you @mumoshu! I have indeed checked thoroughly paginating through all of the pages and there was one workflow dangling, I deleted it and:

NAME                 MIN   MAX   DESIRED   SCHEDULE
gha-ci-runners       0     15    0

Thank you again! 🙏

dennybaa on May 21, 2022

Hello @mumoshu , please hold this one, it’s really unclear

Bump into this as well v0.23.0 (orgwide):

spec:
  - type: PercentageRunnersBusy
    scaleDownAdjustment: 1
    scaleDownThreshold: "0.3"
    scaleUpAdjustment: 2
    scaleUpThreshold: "0.75"
    repositoryNames:
    - repo1
    - repo2
  - type: TotalNumberOfQueuedAndInProgressWorkflowRuns
    repositoryNames:
    - repo1
    - repo2

I have specifically validated gh api /repos/{owner}/{repo}/actions/runs all are in status == completed, however TotalNumberOfQueuedAndInProgressWorkflowRuns showing false number of desired pods!

NAME                 MIN   MAX   DESIRED   SCHEDULE
gha-ci-runners       0     15    1

dennybaa on May 19, 2022

If you think this specific number 8 is too low and doesn’t reflect the actual number of queued and wip workflow runs, it might be due to pagination not working correctly, or the recalculation isn’t happening frequently enough.

This was resolved because the code was in master but not in the latest ‘release’, so when I built an image off of master it worked as expected.

For #1057 (comment), there might be a chance that the 10-minutes registration timeout is not long enough for you? Unfortunately that timeout is hard-coded so you’d need to update code, give it a longer timeout like registrationTimeout := 20 * time.Minute to give it a try.

The weird thing about this, and I can certainly jump over to that issue and comment given the original title of this issue is no longer a true ‘issue’ - there was no further log about the registrationDidTimeout - so theoretically the pod that can’t be checked on whether or not it is busy should of never been thrown into the deletion candidates pile right?

jbkc85 on Jan 20, 2022