actions-runner-controller: Metric TotalNumberOfQueuedAndInProgressWorkflowRuns doesn't seem to line up with Current workflow/job counts
Describe the bug We are still battling the ‘cancellation’ bug in our organization, many times running into pod reconciles at times where it doesn’t make sense. Typically the scale up works (tho since 0.20.1 its been scaling up tremendously slowly, if it even scales up to the proper amount of runners needed - more in details), but scaling down happens prematurely and cancels active workflow jobs. I have been able to test the 0.20.4 version but cancellations occur even more often than not with that release, and that includes being capped at 24 runners for some odd reason. I started monitoring the suggested replicas and am seeing a lot of oddities, which I am not yet able to map to any specific part of the code…
Checks
Example Log from Controller:
2022-01-18T19:15:05.548Z DEBUG actions-runner-controller.horizontalrunnerautoscaler Suggested desired replicas of 8 by TotalNumberOfQueuedAndInProgressWorkflowRuns {"workflow_runs_completed": 0, "workflow_runs_in_progress": 8, "workflow_runs_queued": 0, "workflow_runs_unknown": 0, "namespace": "default", "kind": "runnerdeployment", "name": "runners", "horizontal_runner_autoscaler": "runners-autoscaler"}
Example from CLI written w/ same code to grab workflow runs/jobs:
./airbin/checkctl github list_jobs
2022/01/18 13:15:08 [DEBUG] Total Count: 1 for Perform release (queued)
-----workflow stats for 1713612691-----Jobs Completed: 65Jobs Queued: 18Jobs In Progress: 46Jobs Unknown: 0
To Reproduce Steps to reproduce the behavior:
- Use TotalNumberOfQueuedAndInProgressWorkflowRuns
- Use deployment from screenshots below
- Launch workflow with 100+ jobs (in matrix if possible)
Expected behavior A clear and concise description of what you expected to happen.
Screenshots
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
namespace: default
name: runners
spec:
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
spec:
dockerdWithinRunnerContainer: true
organization: myOrganization
nodeSelector:
com.example/capacity-type: SPOT
com.example/purpose: github-runners
com.example/environment: ops
labels:
- java
- linux
- eks
- self-hosted
resources:
requests:
cpu: "1.0"
memory: "10Gi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
namespace: default
name: runners-autoscaler
spec:
scaleTargetRef:
name:runners
scaleDownDelaySecondsAfterScaleOut: 7200
minReplicas: 1
maxReplicas: 150
metrics:
- type: TotalNumberOfQueuedAndInProgressWorkflowRuns
repositoryNames:
- my_repository
Environment (please complete the following information):
- Controller Version [0.20.1->0.20.4]
- Deployment Method: Helm
Additional context
This has been an ongoing issue but since the upgrade to the 0.20.x
branch I believe its gotten worse. I am not sure if its the GitHub API or something in the calculations to determine the suggested replicas. Either way, is there anything I am missing here? I am not sure why I can get such different results on my own GitHub API calls from command line vs that of the controller’s autoscaler suggested replicas.
scaleDownDelaySecondsAfterScaleOut: 7200
was put in to battle some of the unnecessary kills, it isn’t working as well though because after a scale up and the runners stay at scale, it eventually causes runners to get interrupted (by controller kills) during the middle of tests as well.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 27 (1 by maintainers)
Commits related to this issue
- doc: Enhance troubleshooting guide with the scale-to-zero issue Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1057#issuecomment-1133439061 — committed to actions/actions-runner-controller by mumoshu 2 years ago
- doc: enhance troubleshooting guide with the scale-to-zero issue (#1469) Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1057#issuecomment-1133439061 — committed to actions/actions-runner-controller by mumoshu 2 years ago
Thank you @mumoshu! I have indeed checked thoroughly paginating through all of the pages and there was one workflow dangling, I deleted it and:
Thank you again! 🙏
Hello @mumoshu , please hold this one, it’s really unclear
Bump into this as well v0.23.0 (orgwide):
I have specifically validated
gh api /repos/{owner}/{repo}/actions/runs
all are instatus == completed
, however TotalNumberOfQueuedAndInProgressWorkflowRuns showing false number of desired pods!This was resolved because the code was in master but not in the latest ‘release’, so when I built an image off of master it worked as expected.
The weird thing about this, and I can certainly jump over to that issue and comment given the original title of this issue is no longer a true ‘issue’ - there was no further log about the
registrationDidTimeout
- so theoretically the pod that can’t be checked on whether or not it is busy should of never been thrown into the deletion candidates pile right?