actions-runner-controller: Runner container does not restart even if the job ends normally

Hai ! Thank you always for your Kubernetes operator !!

I have a question. Even if the action job finishes successfully, only certain pods get stuck without performing a runner container restart…

runnerDeployment’s manifest:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: runner
  namespace: runner
spec:
  replicas: 3
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: beta.kubernetes.io/instance-type
                operator: In
                values:
                - c4.2xlarge
                - c5.2xlarge
      envFrom:
      - configMapRef:
          name: ro-actions-runner
      - secretRef:
          name: ro-actions-runner
      image: <aws-id>.dkr.ecr.ap-northeast-1.amazonaws.com/custom-runner:1.15
      labels:
      - ro
      repository: <repository>

runner’s status:

😀 ❯❯❯ k get pod -n runner
NAME                              READY   STATUS    RESTARTS   AGE
ro-runner-52zmh-4tcrt   2/2     Running   0          47m
ro-runner-52zmh-vjb2f   2/2     Running   0          136m
ro-runner-52zmh-wxt6k   1/2     Running   0          124m <<< this runner pod

😀 ❯❯❯ k describe pod -n runner ro-runner-52zmh-wxt6k
Name:           ro-runner-52zmh-wxt6k
Namespace:      runner
Priority:       0
Node:           ip-10-0-56-65.ap-northeast-1.compute.internal/10.0.56.65
Start Time:     Thu, 30 Jul 2020 16:44:09 +0900
Labels:         runner-template-hash=847bc54779
Annotations:    kubernetes.io/psp: eks.privileged
Status:         Running
IP:             10.0.59.171
Controlled By:  Runner/ro-runner-52zmh-wxt6k
Containers:
  runner:
    Container ID:   docker://19080ec1adc8822bfaf916c43797e7171a1c5ff5b1666b0df2485cd9c19db976
    Image:          <aws-id>.dkr.ecr.ap-northeast-1.amazonaws.com/custom-runner:1.15
    Image ID:       docker-pullable://<aws-id>.dkr.ecr.ap-northeast-1.amazonaws.com/custom-runner@sha256:bb88570bc0bedc6c8b8321887e9b6e693c0fb6542aba83e3a839949338f99b73
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed <<< normally finish
      Exit Code:    0 <<< exit status 0
      Started:      Thu, 30 Jul 2020 16:44:10 +0900
      Finished:     Thu, 30 Jul 2020 17:05:52 +0900
    Ready:          False

I’m thinking that the runner container will be restarted when the actions jobs normally finish running by the following processing. Is my perception correct?

https://github.com/summerwind/actions-runner-controller/blob/ba8f61141b30268a00387795a66abdd72b60c78c/controllers/runner_controller.go#L194-L196

runner’s status

😱 ❯❯❯ k get pod -n runner ro-runner-52zmh-wxt6k -o json | jq -r ".status.containerStatuses[].state"
{
  "running": {
    "startedAt": "2020-07-30T07:44:10Z"
  }
}
{
  "terminated": {
    "containerID": "docker://19080ec1adc8822bfaf916c43797e7171a1c5ff5b1666b0df2485cd9c19db976",
    "exitCode": 0,
    "finishedAt": "2020-07-30T08:05:52Z",
    "reason": "Completed",
    "startedAt": "2020-07-30T07:44:10Z"
  }
}

Thank you!

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 16 (12 by maintainers)

Commits related to this issue

Most upvoted comments

The error is here: https://github.com/summerwind/actions-runner-controller/blob/ee8fb5a3886ef5a75df5c126bcd3c846e13c801e/github/github.go#L87

The token does not get refreshed for a 10 minute interval. The old token is returned and the reconcile function returns with Requeue: true - after the token update function is called. If you reduce this 10 minute interval - this should repair this problem. @mumoshu

Since token expiration is not a huge deal - the runner can still make calls even if the registration token has expired, I suggest we change this function to:

	if ok && rt.GetExpiresAt().After(time.Now()) {
		return rt, nil
	}

Top right graph is a sample of stopped docker containers - this also seems to align with the workqueue_depth dropping to zero. The sync period set is 5minute intervals - as seen by the spacing between metrics emitted. Also note the spike in workqueue_total_retries during the period of stopped containers.

Screen Shot 2020-11-18 at 7 39 51 PM

As the stopped runner eventually removed and recreated, this might be due to that any of our controllers doesn’t detect the stopped container until the next resync period?

Generally speaking, any controller that creates pods should “watch” events from the pod, and enqueue a reconciliation on the parent “runner” resource on every pod event. Maybe our runner-controller isn’t working like that?