actions-runner-controller: Pod stuck in NotReady state - RestartPolicy OnFailure?
Describe the bug
The runner
completes a job and exits with code 0, then the pod enters the NotReady
state.
In the meantime GH actions allocated a job for this worker to pick up.
Pod will get cleaned up after some time.
Job never get’s picked up, doesn’t get scheduled elsewhere and just enters a failed state.
Checks
- [ x ] My actions-runner-controller version (v0.x.y) does support the feature
- [ x ] I’m using an unreleased version of the controller I built from HEAD of the default branch
actions-runner-controller
is Yesterday’s master (7156ce04
)image: summerwind/actions-runner:latest
as the runner
To Reproduce
This behaviour might be specific to GKE perhaps?
We’re currently running kubernetes 1.21.6
Scaling setup is webhooks only.
- Create a RunnerDeployment with emphemeral: true
- Run jobs
- At about the time the jobs from (2.) complete, schedule more jobs.
- Sometimes a pod will linger behind in the
NotReady
state, with a job beeing queued from GH actions.
Expected behavior
The pod should restart and pickup the next job if it has one allocated.
And it does so if I intervene manually (delete pod), the job get’s picked up.
So I think the only needed change is RestartPolicy
-> Always
Screenshots I’m sorry, the evidence disappeared from my screens, I’m currently running a fork. If it’s really needed I can produce this quite easily.
So currently the RestartPolicy
is OnFailure
which triggers this behaviour.
Most of the time pods get terminated on completion I think.
I think the right value should be Always
(I’m currently running a fork from yesterdays master with this beeing the only change.)
Is there a reason for it not to be Always
for everyone at all times?
Here kubernetes doesn’t appear to restart it because the exitcode is 0. RestartPolicy: Always
simply restarts it and it fixes the problem.
I started a PR to make it configurable (there appear to be incomplete bits for this already), but I made a mistake somewhere. So before I complete that work (to make it configurable), would it not be better to change the default?
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (5 by maintainers)
FYI, I also faced this issue while using the v0.22.0. I was using AWS EKS. When I switched to v0.21.0, I did not face this issue.
In my case, the runner completes a job and exits with code 0, then the pod enters the NotReady state and Github removes it from self-hosted runner list. There was only one job and there was no other job alloted to the runner after the first job was executed by the runner which was successfully executed.
I followed the exact steps from the README.md
Following is the
runner.yaml
that I usedI’m not running #1127 #1167 yet and due to holidays coming up I don’t want to try these before. I’ll be back on 21st of March and then I’ll give the latest and greatest master (or perhaps release) a go.
In the meantime I’m happy to close this issue, I wasn’t even sure if it was the right place to begin with. It’s related specifically to a master commit/state and might not be relevant really for any(?) tagged release.