actions-runner-controller: Runner container does not restart even if the job ends normally
Hai ! Thank you always for your Kubernetes operator !!
I have a question. Even if the action job finishes successfully, only certain pods get stuck without performing a runner container restart…
runnerDeployment’s manifest:
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: runner
namespace: runner
spec:
replicas: 3
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: beta.kubernetes.io/instance-type
operator: In
values:
- c4.2xlarge
- c5.2xlarge
envFrom:
- configMapRef:
name: ro-actions-runner
- secretRef:
name: ro-actions-runner
image: <aws-id>.dkr.ecr.ap-northeast-1.amazonaws.com/custom-runner:1.15
labels:
- ro
repository: <repository>
runner’s status:
😀 ❯❯❯ k get pod -n runner
NAME READY STATUS RESTARTS AGE
ro-runner-52zmh-4tcrt 2/2 Running 0 47m
ro-runner-52zmh-vjb2f 2/2 Running 0 136m
ro-runner-52zmh-wxt6k 1/2 Running 0 124m <<< this runner pod
😀 ❯❯❯ k describe pod -n runner ro-runner-52zmh-wxt6k
Name: ro-runner-52zmh-wxt6k
Namespace: runner
Priority: 0
Node: ip-10-0-56-65.ap-northeast-1.compute.internal/10.0.56.65
Start Time: Thu, 30 Jul 2020 16:44:09 +0900
Labels: runner-template-hash=847bc54779
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 10.0.59.171
Controlled By: Runner/ro-runner-52zmh-wxt6k
Containers:
runner:
Container ID: docker://19080ec1adc8822bfaf916c43797e7171a1c5ff5b1666b0df2485cd9c19db976
Image: <aws-id>.dkr.ecr.ap-northeast-1.amazonaws.com/custom-runner:1.15
Image ID: docker-pullable://<aws-id>.dkr.ecr.ap-northeast-1.amazonaws.com/custom-runner@sha256:bb88570bc0bedc6c8b8321887e9b6e693c0fb6542aba83e3a839949338f99b73
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed <<< normally finish
Exit Code: 0 <<< exit status 0
Started: Thu, 30 Jul 2020 16:44:10 +0900
Finished: Thu, 30 Jul 2020 17:05:52 +0900
Ready: False
I’m thinking that the runner container will be restarted when the actions jobs normally finish running by the following processing. Is my perception correct?
runner’s status
😱 ❯❯❯ k get pod -n runner ro-runner-52zmh-wxt6k -o json | jq -r ".status.containerStatuses[].state"
{
"running": {
"startedAt": "2020-07-30T07:44:10Z"
}
}
{
"terminated": {
"containerID": "docker://19080ec1adc8822bfaf916c43797e7171a1c5ff5b1666b0df2485cd9c19db976",
"exitCode": 0,
"finishedAt": "2020-07-30T08:05:52Z",
"reason": "Completed",
"startedAt": "2020-07-30T07:44:10Z"
}
}
Thank you!
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 16 (12 by maintainers)
Commits related to this issue
- EphemeralRunner: finalize with container hook cleanup based on labels (#77) Co-authored-by: Tingluo Huang <tingluohuang@github.com> — committed to actions/actions-runner-controller by nikola-jokic a year ago
The error is here: https://github.com/summerwind/actions-runner-controller/blob/ee8fb5a3886ef5a75df5c126bcd3c846e13c801e/github/github.go#L87
The token does not get refreshed for a 10 minute interval. The old token is returned and the reconcile function returns with Requeue: true - after the token update function is called. If you reduce this 10 minute interval - this should repair this problem. @mumoshu
Since token expiration is not a huge deal - the runner can still make calls even if the registration token has expired, I suggest we change this function to:
Top right graph is a sample of stopped docker containers - this also seems to align with the workqueue_depth dropping to zero. The sync period set is 5minute intervals - as seen by the spacing between metrics emitted. Also note the spike in workqueue_total_retries during the period of stopped containers.
As the stopped runner eventually removed and recreated, this might be due to that any of our controllers doesn’t detect the stopped container until the next resync period?
Generally speaking, any controller that creates pods should “watch” events from the pod, and enqueue a reconciliation on the parent “runner” resource on every pod event. Maybe our runner-controller isn’t working like that?