cml: `cml runner` early shutdown from idle-timeout with active job
Cloud provider: gcp
SCM / CICD: github
/ actions
I have been playing with short --idle-timeout
values and I was a bit baffled when I encountered this. One workflow shutdown earlier before the GitHub action completes the final job, and the other completes as expected, but they are using nearly identical workflows.
Successful workflow:
name: Build and run Inference - DEV
on:
push:
branches: [dev]
jobs:
build:
environment: dev
runs-on: ubuntu-latest
steps:
- uses: dacbd/gcr-build-push@main
with:
tags: latest
project-id: ${{ secrets.GCP_PROJECT_ID }}
GCR-key: ${{ secrets.GCP_GCR_KEY }}
container-name: ***
deploy-runner:
runs-on: ubuntu-latest
environment: dev
steps:
- uses: iterative/setup-cml@v1
- uses: actions/checkout@v2
- name: Deploy runner on GCP
env:
GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GCP_CML_RUNNER_KEY }}
run: |
cml-runner \
--single \
--idle-timeout=360 \
--token=${{ secrets.PAT_DCB }} \
--cloud=gcp \
--cloud-region=us-west \
--cloud-type=e2-standard-16
run-inference:
needs: [build, deploy-runner]
runs-on: [self-hosted, cml]
environment: dev
steps:
- uses: actions/checkout@v2
- name: Setup Container image name
run: echo "ML_CONTAINER=gcr.io/${{ secrets.GCP_PROJECT_ID }}/***" >> "${GITHUB_ENV}"
- name: Docker GCR Login
uses: docker/login-action@v1
with:
registry: gcr.io
username: _json_key
password: ${{ secrets.GCP_GCR_KEY }}
- name: Load container from GCR
run: docker pull ${{ env.ML_CONTAINER }}:latest
- name: Run Inference
run: |
docker run ****
Failed workflow:
name: Build and run Inference - PROD
on:
schedule: # Saturday at 1400 UTC
- cron: '0 14 * * 6'
push:
branches: [prod]
jobs:
build:
environment: prod
runs-on: ubuntu-latest
steps:
- uses: dacbd/gcr-build-push@v1
with:
tags: latest
project-id: ${{ secrets.GCP_PROJECT_ID }}
GCR-key: ${{ secrets.GCP_GCR_KEY }}
container-name: ***
deploy-runner:
runs-on: ubuntu-latest
environment: prod
steps:
- uses: iterative/setup-cml@v1
- uses: actions/checkout@v2
- name: Deploy runner on GCP
env:
GOOGLE_APPLICATION_CREDENTIALS_DATA: ${{ secrets.GCP_CML_RUNNER_KEY }}
run: |
cml-runner \
--single \
--idle-timeout=360 \
--token=${{ secrets.PAT_DCB }} \
--cloud=gcp \
--cloud-region=us-west \
--cloud-type=e2-standard-16
run-inference:
needs: [build, deploy-runner]
runs-on: [self-hosted, cml]
environment: prod
steps:
- uses: actions/checkout@v2
- name: Setup Container image name
run: echo "ML_CONTAINER=gcr.io/${{ secrets.GCP_PROJECT_ID }}/***" >> "${GITHUB_ENV}"
- name: Docker GCR Login
uses: docker/login-action@v1
with:
registry: gcr.io
username: _json_key
password: ${{ secrets.GCP_GCR_KEY }}
- name: Load container from GCR
run: docker pull ${{ env.ML_CONTAINER }}
- name: Run Inference
run: |
docker run ***
Successful cml.service
log
daniel_barnes@cml-z53bz3oz1t:~$ sudo journalctl -u cml.service -f
-- Logs begin at Tue 2021-11-09 20:03:04 UTC. --
Nov 09 20:05:47 cml-z53bz3oz1t systemd[1]: Started cml.service.
Nov 09 20:05:52 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Preparing workdir /tmp/tmp.7PolL2jgVJ/.cml/cml-z53bz3oz1t..."}
Nov 09 20:05:52 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Launching github runner"}
Nov 09 20:06:02 cml-z53bz3oz1t cml.sh[17039]: {"level":"warn","message":"SpotNotifier can not be started."}
Nov 09 20:06:03 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:03.618Z","level":"info","message":"runner status","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:06:03 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:03.619Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:06:04 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:04.057Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/yyyy/xxxx","status":"ready"}
Nov 09 20:06:16 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:06:15.944Z","job":4157205540,"level":"info","message":"runner status Running job: run-inference","repo":"https://github.com/yyyy/xxxx","status":"job_started"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.355Z","job":"","level":"info","message":"runner status Job run-inference completed with result: Succeeded","repo":"https://github.com/yyyy/xxxx","status":"job_ended","success":true}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.595Z","level":"info","message":"runner status √ Removed .credentials","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"date":"2021-11-09T20:18:21.596Z","level":"info","message":"runner status √ Removed .runner","repo":"https://github.com/yyyy/xxxx"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"runner status","reason":"proc_exit","status":"terminated"}
Nov 09 20:18:21 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"waiting 20 seconds before exiting..."}
Nov 09 20:18:41 cml-z53bz3oz1t cml.sh[17039]: {"level":"info","message":"Unregistering runner cml-z53bz3oz1t..."}
Nov 09 20:18:41 cml-z53bz3oz1t cml.sh[17039]: {"level":"error","message":"\tFailed: Cannot destructure property 'id' of '(intermediate value)' as it is undefined."}
Nov 09 20:18:44 cml-z53bz3oz1t systemd[1]: cml.service: Succeeded.
Failed cml.service
log
daniel_barnes@cml-uawryesbyo:~$ sudo journalctl -u cml.service -f
-- Logs begin at Tue 2021-11-09 19:23:49 UTC. --
Nov 09 19:26:34 cml-uawryesbyo systemd[1]: Started cml.service.
Nov 09 19:26:39 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Preparing workdir /tmp/tmp.TAz3gTOugR/.cml/cml-uawryesbyo..."}
Nov 09 19:26:39 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Launching github runner"}
Nov 09 19:26:49 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"SpotNotifier can not be started."}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.163Z","level":"info","message":"runner status","repo":"https://github.com/yyyy/xxxx"}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.163Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/yyyy/xxxx"}
Nov 09 19:26:50 cml-uawryesbyo cml.sh[17135]: {"date":"2021-11-09T19:26:50.583Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/yyyy/xxxx","status":"ready"}
Nov 09 19:27:02 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"Failed parsing log: Reduce of empty array with no initial value"}
Nov 09 19:27:02 cml-uawryesbyo cml.sh[17135]: {"level":"warn","message":"Original log bytes, as Base64: 2021-11-09 19:27:01Z: Running job: run-inference\n"}
Nov 09 19:32:51 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"runner status","reason":"timeout:360","status":"terminated"}
Nov 09 19:32:51 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"waiting 20 seconds before exiting..."}
Nov 09 19:33:11 cml-uawryesbyo cml.sh[17135]: {"level":"info","message":"Unregistering runner cml-uawryesbyo..."}
Nov 09 19:33:12 cml-uawryesbyo cml.sh[17135]: {"level":"error","message":"\tFailed: Bad request - Runner \"cml-uawryesbyo\" is still running a job\""}
Nov 09 19:33:22 cml-uawryesbyo systemd[1]: cml.service: Succeeded.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 44 (34 by maintainers)
This particular instance has is a pretty healthy size and has ran several times without incident.
I have seen several Self-Hosted GitHub Runners crash due to being starved of resources and I haven’t seen them behave like this. (Show the job as Cancelled). Even though GitHub says in it’s message that it could be from resources. (At least in our use they seem die in much more dramatic fashion 😂)
I haven’t been able to get this error again, but in my past logs I have not had that line, and the PAT has had the same permissions.
however, in both cases, they were prefaced with a log parsing failure.
for the one that I was able to capture.
@dacbd @mikolajpabiszczak for the record
These logs are very insightful for me. Thanks!
@dacbd I’m aware! Im already working on it 😉
Yes, this comment: https://github.com/iterative/cml/issues/808#issuecomment-1032390230
@mikolajpabiszczak indeed you have helped me a lot indirectly!
Ahhh, I think that it’s my bad now - I believe I found the reason for that: lack of proper permissions to self-hosted runners in
organisation
chapter. Should be ok now. Sorry for the mess!Will come back with this bit later.
What follows is the end of the raw log for
deploy-runner
action.It doesn’t cause the termination itself but it never enters the “has job” state from the cml perspective and so it’s still counting down it’s timeout timer.
On Thu, Jan 20, 2022, 07:03 Casper da Costa-Luis @.***> wrote:
The GH issues were solved some time ago (at least judging by the reports). Let me just say that the problem I described is not incidental - it is constant, it persists from quite some time (~2 weeks, since when I started playing with CML at bigger scale) and I did not manage to have a single run without passing some high
--idle-timeout
value.not the case
not triggered by me for sure
I believe that it is not the case, as both the logs are appearing and the runner is shown as connected/working on the GitHub self hosted runners page.
@mikolajpabiszczak timeout parsing mismatch seems like a different issue: #875.
However despite the mismatch the runner should not shut down while busy. Given:
I can only surmise that GH itself has cancelled the job. This could be due to:
I had this reoccur.
I double-checked that I could still reproduce this error and it seems to be resolved.
I last tested (which failed) on
Nov 22, 2021
@DavidGOrtega / @casperdcl feel free to re-open but it looks like I might have been the only one to have during the time this has been open.
Well seeing this in both workflows now, I’m going to increase the idle-timeout value to bypass this issue since the runtime isn’t that long, but I’ve saved the state in a project branch to revisit this. I’m convinced this is from the failed log parsing error that I have outlined above.