cml: `cml-runner` shut down due to idle timeout while still busy.

Hey everyone, started getting this error a couple of weeks ago, where jobs on GHA are failing with 143 error codes. Did a little digging and it seems like the error is because of the --idle-timeout parameter, which is by default set to 300. Seems like the job thinks it’s idle and gets shut down, but in fact it is actually still running! Not sure why this is happening, setting --idle-timeout=-1 resolves the issue, but I don’t want to do that for obvious reasons.

I’m using cml version 0.12.0 (which addressed a similar issue from what I can see? -> https://github.com/iterative/cml/issues/808 ).

Here are the logs from the runners:

root@ip-172-31-43-163:~# sudo journalctl -u cml.service -f
-- Logs begin at Tue 2022-02-22 16:02:55 UTC. --
Mar 24 17:10:29 ip-172-31-43-163 cml.sh[1869]: sshd override added, restarting daemon
Mar 24 17:10:30 ip-172-31-43-163 sudo[2799]: pam_unix(sudo:session): session closed for user root
Mar 24 17:10:37 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"Preparing workdir /tmp/tmp.NZ4U6HZb1S/.cml/cml-wdw16iotk7..."}
Mar 24 17:10:37 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"Launching github runner"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"EC2 id i-0c767456d3e8c3eec"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.327Z","level":"info","message":"runner status","repo":"https://github.com/continuum-industries/Pareto"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.328Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/continuum-industries/Pareto"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.910Z","level":"info","message":"runner status Current runner version: '2.289.1'","repo":"https://github.com/continuum-industries/Pareto"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.911Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/continuum-industries/Pareto","status":"ready"}
Mar 24 17:11:18 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:11:18.401Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (BASELINE_GREENLINK_ME_CASE_04, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:15:50 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"Runner should be idle. Resetting jobs. Retrying in 300 secs"}
Mar 24 17:16:40 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:16:40.360Z","job":"gh","level":"info","message":"runner status Job Run Optimisations (BASELINE_GREENLINK_ME_CASE_04, 0) completed with result: Succeeded","repo":"https://github.com/continuum-industries/Pareto","status":"job_ended","success":true}
Mar 24 17:16:42 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:16:42.954Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (OFFSHORE_PIPELINE, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:20:51 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"Runner should be idle. Resetting jobs. Retrying in 300 secs"}
Mar 24 17:21:09 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:21:09.794Z","job":"gh","level":"info","message":"runner status Job Run Optimisations (OFFSHORE_PIPELINE, 0) completed with result: Succeeded","repo":"https://github.com/continuum-industries/Pareto","status":"job_ended","success":true}
Mar 24 17:21:15 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:21:15.077Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (REAL_PIPELINE_MOATA_430, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:25:43 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:25:43.309Z","job":"gh","level":"info","message":"runner status Job Run Optimisations (REAL_PIPELINE_MOATA_430, 0) completed with result: Succeeded","repo":"https://github.com/continuum-industries/Pareto","status":"job_ended","success":true}
Mar 24 17:25:46 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:25:46.246Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (SCOTWIND_2, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:25:51 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"Runner should be idle. Resetting jobs. Retrying in 300 secs"}
Mar 24 17:30:51 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"runner status","reason":"timeout:300","status":"terminated"}
Mar 24 17:30:51 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"waiting 10 seconds before exiting..."}
Mar 24 17:31:01 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"Unregistering runner cml-wdw16iotk7..."}
Mar 24 17:31:03 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"\tFailed: Bad request - Runner \"cml-wdw16iotk7\" is still running a job\""}

Any idea what might be going on here?

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 3
Comments: 19 (14 by maintainers)

Most upvoted comments

the Ghost is back! 😢

@0x2b3bfa0 any luck on this? It’s causing a lot of problems with our CI pipeline so would be great to fix asap!

@thatGreekGuy96 My immediate bypass that has worked in the past is to extend the timeout in a different way, use --idle-timeout=21600 / 6h ie greater than what your expected runtime will be. Previously this issue has been intermittent and extremely hard to find what the root cause is. you can take a look here as well for the previous context of something similar

dacbd on Mar 25, 2022

Sounds like some sort of regression introduced with #689… 🙈 I’ll take a look a bit later today and try to find if that’s the case.

0x2b3bfa0 on Mar 24, 2022

Opened here: https://github.com/iterative/cml/issues/973 @0x2b3bfa0

I guess you can mark this issue as closed now.

thatGreekGuy96 on Apr 18, 2022

@thatGreekGuy96, are you still experiencing those unexpected delays? If so, please open a new issue with the details. 🙏🏼

0x2b3bfa0 on Apr 15, 2022

GitHub Response

Apologies for the trouble here.

Engineering is reporting that the underlying problem (which is related to Elasticsearch) has been fixed but the results returned from the filter might not be complete until we have a new index. This could take a couple of days, sorry.

There is a discussion about this in the community and Engineering has responded - https://github.community/t/workflow-call-has-an-inconsistent-status-in-github-rest-api/244164. A banner will also be displayed in the UI to alert you of this issue when you try to filter the workflow runs. Accessing the endpoint without any filter should return all the workflow runs correctly.

I’ll let you know once the index fully rebuilds.

@DavidGOrtega, this will interest you for sure. 😄

0x2b3bfa0 on Apr 13, 2022

🔔 @thatGreekGuy96, @dacbd and @mattlbeck, it should be fixed with v0.14.1

0x2b3bfa0 on Apr 12, 2022

Im very sure that this is a new issue. I had the runner working for hours as a test.

DavidGOrtega on Apr 12, 2022

Cause

Inconsistent behavior in the GitHub REST API:

/repos/{owner}/{repo}/actions/runs returns a list of runs, some of them with {status: "in_progress"}
/repos/{owner}/{repo}/actions/runs?status=completed returns a list of runs with {status: "completed"}
/repos/{owner}/{repo}/actions/runs?status=in_progress always returns an empty list 🤯

Reproducing

Create a GitHub repository and create a file named .github/workflows/test.yml with this content:

on: workflow_dispatch
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - run: sleep infinity

Trigger the workflow with the workflow_dispatch event (instructions)
Set the following environment variables:
- GITHUB_TOKEN — personal access token
- REPOSITORY_OWNER — user/organization who owns the repository; e.g. github.com/user/repository
- REPOSITORY_NAME — name of the repository; e.g. github.com/user/repository
Run the following snippets

No `status` specified

$ curl "https://api.github.com/repos/$REPOSITORY_OWNER/$REPOSITORY_NAME/actions/runs" \
>   --header "Accept: application/vnd.github.v3+json" \
>   --user ":$GITHUB_TOKEN" |
> jq '.workflow_runs[] | .status'
"in_progress"
"completed"
...
"completed"

Setting `status` to `completed`

$ curl "https://api.github.com/repos/$REPOSITORY_OWNER/$REPOSITORY_NAME/actions/runs?status=completed" \
>   --header "Accept: application/vnd.github.v3+json" \
>   --user ":$GITHUB_TOKEN" |
> jq '.workflow_runs[] | .status'
"completed"
"completed"
...
"completed"

Setting `status` to `in_progress` (or any other value)

$ curl "https://api.github.com/repos/$REPOSITORY_OWNER/$REPOSITORY_NAME/actions/runs?status=in_progress" \
>   --header "Accept: application/vnd.github.v3+json" \
>   --user ":$GITHUB_TOKEN"
{
  "total_count": 0,
  "workflow_runs": [

  ]
}

0x2b3bfa0 on Apr 12, 2022

I am having the same issue on several GitHub projects, but using the same AWS account. We can’t seem to run any workflows that make it past the alloted timeout.

https://discord.com/channels/485586884165107732/728693131557732403/963180471326490734

mattlbeck on Apr 11, 2022

No worries @0x2b3bfa0, i’ll do that asap. I’ll get you the logs asap but from what I can tell the issue is still occurring.

thatGreekGuy96 on Apr 3, 2022