actions-runner-controller: Pod stuck in NotReady state - RestartPolicy OnFailure?

Describe the bug The runner completes a job and exits with code 0, then the pod enters the NotReady state. In the meantime GH actions allocated a job for this worker to pick up. Pod will get cleaned up after some time. Job never get’s picked up, doesn’t get scheduled elsewhere and just enters a failed state.

Checks

  • [ x ] My actions-runner-controller version (v0.x.y) does support the feature
  • [ x ] I’m using an unreleased version of the controller I built from HEAD of the default branch actions-runner-controller is Yesterday’s master (7156ce04) image: summerwind/actions-runner:latest as the runner

To Reproduce This behaviour might be specific to GKE perhaps? We’re currently running kubernetes 1.21.6 Scaling setup is webhooks only.

  1. Create a RunnerDeployment with emphemeral: true
  2. Run jobs
  3. At about the time the jobs from (2.) complete, schedule more jobs.
  4. Sometimes a pod will linger behind in the NotReady state, with a job beeing queued from GH actions.

Expected behavior The pod should restart and pickup the next job if it has one allocated. And it does so if I intervene manually (delete pod), the job get’s picked up. So I think the only needed change is RestartPolicy -> Always

Screenshots I’m sorry, the evidence disappeared from my screens, I’m currently running a fork. If it’s really needed I can produce this quite easily.

So currently the RestartPolicy is OnFailure which triggers this behaviour. Most of the time pods get terminated on completion I think.

I think the right value should be Always (I’m currently running a fork from yesterdays master with this beeing the only change.)

Is there a reason for it not to be Always for everyone at all times? Here kubernetes doesn’t appear to restart it because the exitcode is 0. RestartPolicy: Always simply restarts it and it fixes the problem.

I started a PR to make it configurable (there appear to be incomplete bits for this already), but I made a mistake somewhere. So before I complete that work (to make it configurable), would it not be better to change the default?

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (5 by maintainers)

Most upvoted comments

FYI, I also faced this issue while using the v0.22.0. I was using AWS EKS. When I switched to v0.21.0, I did not face this issue.

In my case, the runner completes a job and exits with code 0, then the pod enters the NotReady state and Github removes it from self-hosted runner list. There was only one job and there was no other job alloted to the runner after the first job was executed by the runner which was successfully executed.

I followed the exact steps from the README.md

  • Installed cert-manager using kubectl
  • kubectl deployment for CRDs and actions-runner-controller
  • Used Github PAT with kubectl secret for authentication

Following is the runner.yaml that I used

apiVersion: actions.summerwind.dev/v1alpha1
kind: Runner
metadata:
  name: aws-eks-runner
spec:
  repository: some-github-repo
  env: []

I’m not running #1127 #1167 yet and due to holidays coming up I don’t want to try these before. I’ll be back on 21st of March and then I’ll give the latest and greatest master (or perhaps release) a go.

In the meantime I’m happy to close this issue, I wasn’t even sure if it was the right place to begin with. It’s related specifically to a master commit/state and might not be relevant really for any(?) tagged release.