cml: `cml runner` early shutdown from idle-timeout with active job

Cloud provider: gcp SCM / CICD: github / actions

I have been playing with short --idle-timeout values and I was a bit baffled when I encountered this. One workflow shutdown earlier before the GitHub action completes the final job, and the other completes as expected, but they are using nearly identical workflows.

Successful workflow:

name: Build and run Inference - DEV
on:
  push:
    branches: [dev]
jobs:
  build:
    environment: dev
    runs-on: ubuntu-latest
    steps:
      - uses: dacbd/gcr-build-push@main
        with:
          tags: latest
          project-id: ${{ secrets.GCP_PROJECT_ID }}
          GCR-key: ${{ secrets.GCP_GCR_KEY }}
          container-name: ***
  deploy-runner:
    runs-on: ubuntu-latest
    environment: dev
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v2
      - name: Deploy runner on GCP
        env:
          GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GCP_CML_RUNNER_KEY }}
        run: |
          cml-runner \
            --single \
            --idle-timeout=360 \
            --token=${{ secrets.PAT_DCB }} \
            --cloud=gcp \
            --cloud-region=us-west \
            --cloud-type=e2-standard-16
  run-inference:
    needs: [build, deploy-runner]
    runs-on: [self-hosted, cml]
    environment: dev
    steps:
      - uses: actions/checkout@v2
      - name: Setup Container image name
        run: echo "ML_CONTAINER=gcr.io/${{ secrets.GCP_PROJECT_ID }}/***" >> "${GITHUB_ENV}"
      - name: Docker GCR Login
        uses: docker/login-action@v1
        with: 
          registry: gcr.io
          username: _json_key
          password: ${{ secrets.GCP_GCR_KEY }}
      - name: Load container from GCR
        run: docker pull ${{ env.ML_CONTAINER }}:latest
      - name: Run Inference
        run: |
          docker run ****

Failed workflow:

name: Build and run Inference - PROD
on:
  schedule: # Saturday at 1400 UTC
    - cron: '0 14 * * 6'
  push:
    branches: [prod]
jobs:
  build:
    environment: prod
    runs-on: ubuntu-latest
    steps:
      - uses: dacbd/gcr-build-push@v1
        with:
          tags: latest
          project-id: ${{ secrets.GCP_PROJECT_ID }}
          GCR-key: ${{ secrets.GCP_GCR_KEY }}
          container-name: ***
  deploy-runner:
    runs-on: ubuntu-latest
    environment: prod
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v2
      - name: Deploy runner on GCP
        env:
          GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GCP_CML_RUNNER_KEY }}
        run: |
          cml-runner \
            --single \
            --idle-timeout=360 \
            --token=${{ secrets.PAT_DCB }} \
            --cloud=gcp \
            --cloud-region=us-west \
            --cloud-type=e2-standard-16
  run-inference:
    needs: [build, deploy-runner]
    runs-on: [self-hosted, cml]
    environment: prod
    steps:
      - uses: actions/checkout@v2
      - name: Setup Container image name
        run: echo "ML_CONTAINER=gcr.io/${{ secrets.GCP_PROJECT_ID }}/***" >> "${GITHUB_ENV}"
      - name: Docker GCR Login
        uses: docker/login-action@v1
        with: 
          registry: gcr.io
          username: _json_key
          password: ${{ secrets.GCP_GCR_KEY }}
      - name: Load container from GCR
        run: docker pull ${{ env.ML_CONTAINER }}
      - name: Run Inference
        run: |
          docker run ***

Successful cml.service log

daniel_barnes@cml-z53bz3oz1t:~$ sudo journalctl -u cml.service -f
-- Logs begin at Tue 2021-11-09 20:03:04 UTC. --
Nov 09 20:05:47 cml-z53bz3oz1t systemd[1]: Started cml.service.
Nov 09 20:05:52 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Preparing workdir /tmp/tmp.7PolL2jgVJ/.cml/cml-z53bz3oz1t..."}
Nov 09 20:05:52 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Launching github runner"}
Nov 09 20:06:02 cml-z53bz3oz1t cml.sh[17039]: {"level":"warn","message":"SpotNotifier can not be started."}
Nov 09 20:06:03 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:03.618Z","level":"info","message":"runner status","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:06:03 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:03.619Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:06:04 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:04.057Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/yyyy/xxxx","status":"ready"}
Nov 09 20:06:16 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:15.944Z","job":4157205540,"level":"info","message":"runner status Running job: run-inference","repo":"https://github.com/yyyy/xxxx","status":"job_started"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.355Z","job":"","level":"info","message":"runner status Job run-inference completed with result: Succeeded","repo":"https://github.com/yyyy/xxxx","status":"job_ended","success":true}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.595Z","level":"info","message":"runner status √ Removed .credentials","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.596Z","level":"info","message":"runner status √ Removed .runner","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"runner status","reason":"proc_exit","status":"terminated"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"waiting 20 seconds before exiting..."}
Nov 09 20:18:41 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Unregistering runner cml-z53bz3oz1t..."}
Nov 09 20:18:41 cml-z53bz3oz1t cml.sh[17039]: {"level":"error","message":"\tFailed: Cannot destructure property 'id' of '(intermediate value)' as it is undefined."}
Nov 09 20:18:44 cml-z53bz3oz1t systemd[1]: cml.service: Succeeded.

Failed cml.service log

daniel_barnes@cml-uawryesbyo:~$ sudo journalctl -u cml.service -f
-- Logs begin at Tue 2021-11-09 19:23:49 UTC. --
Nov 09 19:26:34 cml-uawryesbyo systemd[1]: Started cml.service.
Nov 09 19:26:39 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Preparing workdir /tmp/tmp.TAz3gTOugR/.cml/cml-uawryesbyo..."}
Nov 09 19:26:39 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Launching github runner"}
Nov 09 19:26:49 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"SpotNotifier can not be started."}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.163Z","level":"info","message":"runner status","repo":"https://github.com/yyyy/xxxx"}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.163Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/yyyy/xxxx"}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.583Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/yyyy/xxxx","status":"ready"}
Nov 09 19:27:02 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"Failed parsing log: Reduce of empty array with no initial value"}
Nov 09 19:27:02 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"Original log bytes, as Base64: 2021-11-09 19:27:01Z: Running job: run-inference\n"}
Nov 09 19:32:51 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"runner status","reason":"timeout:360","status":"terminated"}
Nov 09 19:32:51 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"waiting 20 seconds before exiting..."}
Nov 09 19:33:11 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Unregistering runner cml-uawryesbyo..."}
Nov 09 19:33:12 cml-uawryesbyo cml.sh[17135]: {"level":"error","message":"\tFailed: Bad request - Runner \"cml-uawryesbyo\" is still running a job\""}
Nov 09 19:33:22 cml-uawryesbyo systemd[1]: cml.service: Succeeded.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 44 (34 by maintainers)

Most upvoted comments

This particular instance has is a pretty healthy size and has ran several times without incident.

I have seen several Self-Hosted GitHub Runners crash due to being starved of resources and I haven’t seen them behave like this. (Show the job as Cancelled). Even though GitHub says in it’s message that it could be from resources. (At least in our use they seem die in much more dramatic fashion 😂)

dacbd on Mar 2, 2022

I haven’t been able to get this error again, but in my past logs I have not had that line, and the PAT has had the same permissions.

however, in both cases, they were prefaced with a log parsing failure.

Nov 09 19:27:02 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"Failed parsing log: Reduce of empty array with no initial value"}

for the one that I was able to capture.

dacbd on Feb 15, 2022

@dacbd @mikolajpabiszczak for the record

DavidGOrtega on Feb 22, 2022

These logs are very insightful for me. Thanks!

DavidGOrtega on Feb 8, 2022

@dacbd I’m aware! Im already working on it 😉

DavidGOrtega on Feb 15, 2022

@mikolajpabiszczak in this thread? I must be blind 👓

Yes, this comment: https://github.com/iterative/cml/issues/808#issuecomment-1032390230

mikolajpabiszczak on Feb 15, 2022

@mikolajpabiszczak indeed you have helped me a lot indirectly!

DavidGOrtega on Feb 15, 2022

Ahhh, I think that it’s my bad now - I believe I found the reason for that: lack of proper permissions to self-hosted runners in organisation chapter. Should be ok now. Sorry for the mess!

mikolajpabiszczak on Feb 15, 2022

@mikolajpabiszczak can you capture your cml.service logs from the instance to see if you get the same error you can see in mine from the above?

Will come back with this bit later.

@mikolajpabiszczak what were the log lines after the ones you’d posted?

What follows is the end of the raw log for deploy-runner action.

2022-01-20T10:30:52.0025949Z '43143f89dc9af728a7c89f09066ebb13f68d14ea'
2022-01-20T10:30:52.0217021Z ##[group]Run cml runner \
2022-01-20T10:30:52.0217824Z [36;1mcml runner \[0m
2022-01-20T10:30:52.0218291Z [36;1m    --cloud=aws \[0m
2022-01-20T10:30:52.0218790Z [36;1m    --cloud-region=eu-west-1 \[0m
2022-01-20T10:30:52.0219404Z [36;1m    --cloud-type=g5.2xlarge \[0m
2022-01-20T10:30:52.0219911Z [36;1m    --cloud-spot \[0m
2022-01-20T10:30:52.0220398Z [36;1m    --idle-timeout=-1 \[0m
2022-01-20T10:30:52.0220911Z [36;1m    --labels=training-fold[0m
2022-01-20T10:30:52.0278602Z shell: /usr/bin/bash -e {0}
2022-01-20T10:30:52.0279066Z env:
2022-01-20T10:30:52.0279706Z   AWS_ACCESS_KEY_ID: ***
2022-01-20T10:30:52.0280293Z   AWS_SECRET_ACCESS_KEY: ***
2022-01-20T10:30:52.0280923Z   REPO_TOKEN: ***
2022-01-20T10:30:52.0281364Z ##[endgroup]
2022-01-20T10:30:52.6846750Z {"level":"warn","message":"ignoring RUNNER_NAME environment variable, use CML_RUNNER_NAME or --name instead"}
2022-01-20T10:30:55.9902861Z {"level":"info","message":"Preparing workdir /home/runner/.cml/cml-wrczclxokb..."}
2022-01-20T10:30:55.9911256Z {"level":"info","message":"Deploying cloud runner plan..."}
2022-01-20T10:30:55.9912957Z {"level":"info","message":"Terraform apply..."}
2022-01-20T10:32:13.9705684Z {"level":"info","message":"{\"awsSecurityGroup\":null,\"awsSubnetId\":null,\"cloud\":\"aws\",\"driver\":\"github\",\"id\":\"iterative-gz890wjbruo9\",\"idleTimeout\":300,\"image\":\"iterative-cml\",\"instanceGpu\":null,\"instanceHddSize\":35,\"instanceIp\":\"18.203.69.174\",\"instanceLaunchTime\":\"2022-01-20T10:31:04Z\",\"instanceType\":\"g5.2xlarge\",\"instancePermissionSet\":null,\"labels\":\"training-fold\",\"metadata\":null,\"name\":\"cml-wrczclxokb\",\"region\":\"eu-west-1\",\"repo\":\"https://github.com/yyyu/xxxx\",\"single\":false,\"spot\":true,\"spotPrice\":-1,\"timeouts\":null}"}
2022-01-20T10:32:13.9848301Z Post job cleanup.
2022-01-20T10:32:14.1102631Z [command]/usr/bin/git version
2022-01-20T10:32:14.1149812Z git version 2.34.1
2022-01-20T10:32:14.1187063Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2022-01-20T10:32:14.1228443Z [command]/usr/bin/git submodule foreach --recursive git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :
2022-01-20T10:32:14.1462802Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2022-01-20T10:32:14.1490828Z http.https://github.com/.extraheader
2022-01-20T10:32:14.1500351Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2022-01-20T10:32:14.1538014Z [command]/usr/bin/git submodule foreach --recursive git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :
2022-01-20T10:32:14.1932439Z Cleaning up orphan processes

mikolajpabiszczak on Feb 21, 2022

It doesn’t cause the termination itself but it never enters the “has job” state from the cml perspective and so it’s still counting down it’s timeout timer.

On Thu, Jan 20, 2022, 07:03 Casper da Costa-Luis @.***> wrote:

you can see a parsing error where it would normally detect that the actions runner has received a job

That’s just a warning (not an error, i.e. it shouldn’t cause a termination) AFAIK /CC @DavidGOrtega https://github.com/DavidGOrtega @0x2b3bfa0 https://github.com/0x2b3bfa0 (incidentally, #876 https://github.com/iterative/cml/pull/876).

— Reply to this email directly, view it on GitHub https://github.com/iterative/cml/issues/808#issuecomment-1017598801, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIN7MZAJXBPRXW4ES5RMWLUXAP3FANCNFSM5HWHD74Q . You are receiving this because you modified the open/close state.Message ID: @.***>

dacbd on Jan 20, 2022

@mikolajpabiszczak timeout parsing mismatch seems like a different issue: #875.

However despite the mismatch the runner should not shut down while busy. Given:

transient issue with GH + GCP

transient issue with GH + AWS

error messages all clearly stating “The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled”

The GH issues were solved some time ago (at least judging by the reports). Let me just say that the problem I described is not incidental - it is constant, it persists from quite some time (~2 weeks, since when I started playing with CML at bigger scale) and I did not manage to have a single run without passing some high --idle-timeout value.

I can only surmise that GH itself has cancelled the job. This could be due to:

someone cancelling the job via the GH interface

not the case

some GH API call to cancel the job

not triggered by me for sure

transient networking issues stopping GH from talking to the instance quick enough, thus making GH cancel the job

I believe that it is not the case, as both the logs are appearing and the runner is shown as connected/working on the GitHub self hosted runners page.

mikolajpabiszczak on Jan 20, 2022

@mikolajpabiszczak timeout parsing mismatch seems like a different issue: #875.

However despite the mismatch the runner should not shut down while busy. Given:

transient issue with GH + GCP
transient issue with GH + AWS
error messages all clearly stating “The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled”

I can only surmise that GH itself has cancelled the job. This could be due to:

someone cancelling the job via the GH interface
some GH API call to cancel the job
transient networking issues stopping GH from talking to the instance quick enough, thus making GH cancel the job

casperdcl on Jan 20, 2022

I had this reoccur.

dacbd on Jan 19, 2022

I double-checked that I could still reproduce this error and it seems to be resolved.

I last tested (which failed) on Nov 22, 2021

@DavidGOrtega / @casperdcl feel free to re-open but it looks like I might have been the only one to have during the time this has been open.

dacbd on Jan 13, 2022

Well seeing this in both workflows now, I’m going to increase the idle-timeout value to bypass this issue since the runtime isn’t that long, but I’ve saved the state in a project branch to revisit this. I’m convinced this is from the failed log parsing error that I have outlined above.

dacbd on Nov 10, 2021