cml: `cml runner` early shutdown from idle-timeout with active job

Cloud provider: gcp SCM / CICD: github / actions

I have been playing with short --idle-timeout values and I was a bit baffled when I encountered this. One workflow shutdown earlier before the GitHub action completes the final job, and the other completes as expected, but they are using nearly identical workflows.

Successful workflow:

name: Build and run Inference - DEV
on:
  push:
    branches: [dev]
jobs:
  build:
    environment: dev
    runs-on: ubuntu-latest
    steps:
      - uses: dacbd/gcr-build-push@main
        with:
          tags: latest
          project-id: ${{ secrets.GCP_PROJECT_ID }}
          GCR-key: ${{ secrets.GCP_GCR_KEY }}
          container-name: ***
  deploy-runner:
    runs-on: ubuntu-latest
    environment: dev
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v2
      - name: Deploy runner on GCP
        env:
          GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GCP_CML_RUNNER_KEY }}
        run: |
          cml-runner \
            --single \
            --idle-timeout=360 \
            --token=${{ secrets.PAT_DCB }} \
            --cloud=gcp \
            --cloud-region=us-west \
            --cloud-type=e2-standard-16
  run-inference:
    needs: [build, deploy-runner]
    runs-on: [self-hosted, cml]
    environment: dev
    steps:
      - uses: actions/checkout@v2
      - name: Setup Container image name
        run: echo "ML_CONTAINER=gcr.io/${{ secrets.GCP_PROJECT_ID }}/***" >> "${GITHUB_ENV}"
      - name: Docker GCR Login
        uses: docker/login-action@v1
        with: 
          registry: gcr.io
          username: _json_key
          password: ${{ secrets.GCP_GCR_KEY }}
      - name: Load container from GCR
        run: docker pull ${{ env.ML_CONTAINER }}:latest
      - name: Run Inference
        run: |
          docker run ****

Failed workflow:

name: Build and run Inference - PROD
on:
  schedule: # Saturday at 1400 UTC
    - cron: '0 14 * * 6'
  push:
    branches: [prod]
jobs:
  build:
    environment: prod
    runs-on: ubuntu-latest
    steps:
      - uses: dacbd/gcr-build-push@v1
        with:
          tags: latest
          project-id: ${{ secrets.GCP_PROJECT_ID }}
          GCR-key: ${{ secrets.GCP_GCR_KEY }}
          container-name: ***
  deploy-runner:
    runs-on: ubuntu-latest
    environment: prod
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v2
      - name: Deploy runner on GCP
        env:
          GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GCP_CML_RUNNER_KEY }}
        run: |
          cml-runner \
            --single \
            --idle-timeout=360 \
            --token=${{ secrets.PAT_DCB }} \
            --cloud=gcp \
            --cloud-region=us-west \
            --cloud-type=e2-standard-16
  run-inference:
    needs: [build, deploy-runner]
    runs-on: [self-hosted, cml]
    environment: prod
    steps:
      - uses: actions/checkout@v2
      - name: Setup Container image name
        run: echo "ML_CONTAINER=gcr.io/${{ secrets.GCP_PROJECT_ID }}/***" >> "${GITHUB_ENV}"
      - name: Docker GCR Login
        uses: docker/login-action@v1
        with: 
          registry: gcr.io
          username: _json_key
          password: ${{ secrets.GCP_GCR_KEY }}
      - name: Load container from GCR
        run: docker pull ${{ env.ML_CONTAINER }}
      - name: Run Inference
        run: |
          docker run ***

Successful cml.service log

daniel_barnes@cml-z53bz3oz1t:~$ sudo journalctl -u cml.service -f
-- Logs begin at Tue 2021-11-09 20:03:04 UTC. --
Nov 09 20:05:47 cml-z53bz3oz1t systemd[1]: Started cml.service.
Nov 09 20:05:52 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Preparing workdir /tmp/tmp.7PolL2jgVJ/.cml/cml-z53bz3oz1t..."}
Nov 09 20:05:52 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Launching github runner"}
Nov 09 20:06:02 cml-z53bz3oz1t cml.sh[17039]: {"level":"warn","message":"SpotNotifier can not be started."}
Nov 09 20:06:03 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:03.618Z","level":"info","message":"runner status","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:06:03 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:03.619Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:06:04 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:04.057Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/yyyy/xxxx","status":"ready"}
Nov 09 20:06:16 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:15.944Z","job":4157205540,"level":"info","message":"runner status Running job: run-inference","repo":"https://github.com/yyyy/xxxx","status":"job_started"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.355Z","job":"","level":"info","message":"runner status Job run-inference completed with result: Succeeded","repo":"https://github.com/yyyy/xxxx","status":"job_ended","success":true}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.595Z","level":"info","message":"runner status √ Removed .credentials","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.596Z","level":"info","message":"runner status √ Removed .runner","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"runner status","reason":"proc_exit","status":"terminated"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"waiting 20 seconds before exiting..."}
Nov 09 20:18:41 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Unregistering runner cml-z53bz3oz1t..."}
Nov 09 20:18:41 cml-z53bz3oz1t cml.sh[17039]: {"level":"error","message":"\tFailed: Cannot destructure property 'id' of '(intermediate value)' as it is undefined."}
Nov 09 20:18:44 cml-z53bz3oz1t systemd[1]: cml.service: Succeeded.

Failed cml.service log

daniel_barnes@cml-uawryesbyo:~$ sudo journalctl -u cml.service -f
-- Logs begin at Tue 2021-11-09 19:23:49 UTC. --
Nov 09 19:26:34 cml-uawryesbyo systemd[1]: Started cml.service.
Nov 09 19:26:39 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Preparing workdir /tmp/tmp.TAz3gTOugR/.cml/cml-uawryesbyo..."}
Nov 09 19:26:39 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Launching github runner"}
Nov 09 19:26:49 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"SpotNotifier can not be started."}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.163Z","level":"info","message":"runner status","repo":"https://github.com/yyyy/xxxx"}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.163Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/yyyy/xxxx"}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.583Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/yyyy/xxxx","status":"ready"}
Nov 09 19:27:02 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"Failed parsing log: Reduce of empty array with no initial value"}
Nov 09 19:27:02 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"Original log bytes, as Base64: 2021-11-09 19:27:01Z: Running job: run-inference\n"}
Nov 09 19:32:51 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"runner status","reason":"timeout:360","status":"terminated"}
Nov 09 19:32:51 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"waiting 20 seconds before exiting..."}
Nov 09 19:33:11 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Unregistering runner cml-uawryesbyo..."}
Nov 09 19:33:12 cml-uawryesbyo cml.sh[17135]: {"level":"error","message":"\tFailed: Bad request - Runner \"cml-uawryesbyo\" is still running a job\""}
Nov 09 19:33:22 cml-uawryesbyo systemd[1]: cml.service: Succeeded.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 44 (34 by maintainers)

Most upvoted comments

This particular instance has is a pretty healthy size and has ran several times without incident.

I have seen several Self-Hosted GitHub Runners crash due to being starved of resources and I haven’t seen them behave like this. (Show the job as Cancelled). Even though GitHub says in it’s message that it could be from resources. (At least in our use they seem die in much more dramatic fashion 😂)

I haven’t been able to get this error again, but in my past logs I have not had that line, and the PAT has had the same permissions.

however, in both cases, they were prefaced with a log parsing failure.

Nov 09 19:27:02 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"Failed parsing log: Reduce of empty array with no initial value"}

for the one that I was able to capture.

These logs are very insightful for me. Thanks!

@dacbd I’m aware! Im already working on it 😉

@mikolajpabiszczak indeed you have helped me a lot indirectly!

Ahhh, I think that it’s my bad now - I believe I found the reason for that: lack of proper permissions to self-hosted runners in organisation chapter. Should be ok now. Sorry for the mess!

@mikolajpabiszczak can you capture your cml.service logs from the instance to see if you get the same error you can see in mine from the above?

Will come back with this bit later.

@mikolajpabiszczak what were the log lines after the ones you’d posted?

What follows is the end of the raw log for deploy-runner action.

2022-01-20T10:30:52.0025949Z '43143f89dc9af728a7c89f09066ebb13f68d14ea'
2022-01-20T10:30:52.0217021Z ##[group]Run cml runner \
2022-01-20T10:30:52.0217824Z cml runner \
2022-01-20T10:30:52.0218291Z     --cloud=aws \
2022-01-20T10:30:52.0218790Z     --cloud-region=eu-west-1 \
2022-01-20T10:30:52.0219404Z     --cloud-type=g5.2xlarge \
2022-01-20T10:30:52.0219911Z     --cloud-spot \
2022-01-20T10:30:52.0220398Z     --idle-timeout=-1 \
2022-01-20T10:30:52.0220911Z     --labels=training-fold
2022-01-20T10:30:52.0278602Z shell: /usr/bin/bash -e {0}
2022-01-20T10:30:52.0279066Z env:
2022-01-20T10:30:52.0279706Z   AWS_ACCESS_KEY_ID: ***
2022-01-20T10:30:52.0280293Z   AWS_SECRET_ACCESS_KEY: ***
2022-01-20T10:30:52.0280923Z   REPO_TOKEN: ***
2022-01-20T10:30:52.0281364Z ##[endgroup]
2022-01-20T10:30:52.6846750Z {"level":"warn","message":"ignoring RUNNER_NAME environment variable, use CML_RUNNER_NAME or --name instead"}
2022-01-20T10:30:55.9902861Z {"level":"info","message":"Preparing workdir /home/runner/.cml/cml-wrczclxokb..."}
2022-01-20T10:30:55.9911256Z {"level":"info","message":"Deploying cloud runner plan..."}
2022-01-20T10:30:55.9912957Z {"level":"info","message":"Terraform apply..."}
2022-01-20T10:32:13.9705684Z {"level":"info","message":"{\"awsSecurityGroup\":null,\"awsSubnetId\":null,\"cloud\":\"aws\",\"driver\":\"github\",\"id\":\"iterative-gz890wjbruo9\",\"idleTimeout\":300,\"image\":\"iterative-cml\",\"instanceGpu\":null,\"instanceHddSize\":35,\"instanceIp\":\"18.203.69.174\",\"instanceLaunchTime\":\"2022-01-20T10:31:04Z\",\"instanceType\":\"g5.2xlarge\",\"instancePermissionSet\":null,\"labels\":\"training-fold\",\"metadata\":null,\"name\":\"cml-wrczclxokb\",\"region\":\"eu-west-1\",\"repo\":\"https://github.com/yyyu/xxxx\",\"single\":false,\"spot\":true,\"spotPrice\":-1,\"timeouts\":null}"}
2022-01-20T10:32:13.9848301Z Post job cleanup.
2022-01-20T10:32:14.1102631Z [command]/usr/bin/git version
2022-01-20T10:32:14.1149812Z git version 2.34.1
2022-01-20T10:32:14.1187063Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2022-01-20T10:32:14.1228443Z [command]/usr/bin/git submodule foreach --recursive git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :
2022-01-20T10:32:14.1462802Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2022-01-20T10:32:14.1490828Z http.https://github.com/.extraheader
2022-01-20T10:32:14.1500351Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2022-01-20T10:32:14.1538014Z [command]/usr/bin/git submodule foreach --recursive git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :
2022-01-20T10:32:14.1932439Z Cleaning up orphan processes

It doesn’t cause the termination itself but it never enters the “has job” state from the cml perspective and so it’s still counting down it’s timeout timer.

On Thu, Jan 20, 2022, 07:03 Casper da Costa-Luis @.***> wrote:

you can see a parsing error where it would normally detect that the actions runner has received a job

That’s just a warning (not an error, i.e. it shouldn’t cause a termination) AFAIK /CC @DavidGOrtega https://github.com/DavidGOrtega @0x2b3bfa0 https://github.com/0x2b3bfa0 (incidentally, #876 https://github.com/iterative/cml/pull/876).

— Reply to this email directly, view it on GitHub https://github.com/iterative/cml/issues/808#issuecomment-1017598801, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIN7MZAJXBPRXW4ES5RMWLUXAP3FANCNFSM5HWHD74Q . You are receiving this because you modified the open/close state.Message ID: @.***>

@mikolajpabiszczak timeout parsing mismatch seems like a different issue: #875.

However despite the mismatch the runner should not shut down while busy. Given:

  • transient issue with GH + GCP
  • transient issue with GH + AWS
  • error messages all clearly stating “The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled”

The GH issues were solved some time ago (at least judging by the reports). Let me just say that the problem I described is not incidental - it is constant, it persists from quite some time (~2 weeks, since when I started playing with CML at bigger scale) and I did not manage to have a single run without passing some high --idle-timeout value.

I can only surmise that GH itself has cancelled the job. This could be due to:

  • someone cancelling the job via the GH interface

not the case

  • some GH API call to cancel the job

not triggered by me for sure

  • transient networking issues stopping GH from talking to the instance quick enough, thus making GH cancel the job

I believe that it is not the case, as both the logs are appearing and the runner is shown as connected/working on the GitHub self hosted runners page.

@mikolajpabiszczak timeout parsing mismatch seems like a different issue: #875.

However despite the mismatch the runner should not shut down while busy. Given:

  • transient issue with GH + GCP
  • transient issue with GH + AWS
  • error messages all clearly stating “The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled”

I can only surmise that GH itself has cancelled the job. This could be due to:

  • someone cancelling the job via the GH interface
  • some GH API call to cancel the job
  • transient networking issues stopping GH from talking to the instance quick enough, thus making GH cancel the job

I had this reoccur.

I double-checked that I could still reproduce this error and it seems to be resolved.

I last tested (which failed) on Nov 22, 2021

@DavidGOrtega / @casperdcl feel free to re-open but it looks like I might have been the only one to have during the time this has been open.

Well seeing this in both workflows now, I’m going to increase the idle-timeout value to bypass this issue since the runtime isn’t that long, but I’ve saved the state in a project branch to revisit this. I’m convinced this is from the failed log parsing error that I have outlined above.