cml: `cml-runner` shut down due to idle timeout while still busy.

Hey everyone, started getting this error a couple of weeks ago, where jobs on GHA are failing with 143 error codes. Did a little digging and it seems like the error is because of the --idle-timeout parameter, which is by default set to 300. Seems like the job thinks it’s idle and gets shut down, but in fact it is actually still running! Not sure why this is happening, setting --idle-timeout=-1 resolves the issue, but I don’t want to do that for obvious reasons.

I’m using cml version 0.12.0 (which addressed a similar issue from what I can see? -> https://github.com/iterative/cml/issues/808 ).

Here are the logs from the runners:

root@ip-172-31-43-163:~# sudo journalctl -u cml.service -f
-- Logs begin at Tue 2022-02-22 16:02:55 UTC. --
Mar 24 17:10:29 ip-172-31-43-163 cml.sh[1869]: sshd override added, restarting daemon
Mar 24 17:10:30 ip-172-31-43-163 sudo[2799]: pam_unix(sudo:session): session closed for user root
Mar 24 17:10:37 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"Preparing workdir /tmp/tmp.NZ4U6HZb1S/.cml/cml-wdw16iotk7..."}
Mar 24 17:10:37 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"Launching github runner"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"EC2 id i-0c767456d3e8c3eec"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.327Z","level":"info","message":"runner status","repo":"https://github.com/continuum-industries/Pareto"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.328Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/continuum-industries/Pareto"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.910Z","level":"info","message":"runner status Current runner version: '2.289.1'","repo":"https://github.com/continuum-industries/Pareto"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.911Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/continuum-industries/Pareto","status":"ready"}
Mar 24 17:11:18 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:11:18.401Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (BASELINE_GREENLINK_ME_CASE_04, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:15:50 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"Runner should be idle. Resetting jobs. Retrying in 300 secs"}
Mar 24 17:16:40 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:16:40.360Z","job":"gh","level":"info","message":"runner status Job Run Optimisations (BASELINE_GREENLINK_ME_CASE_04, 0) completed with result: Succeeded","repo":"https://github.com/continuum-industries/Pareto","status":"job_ended","success":true}
Mar 24 17:16:42 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:16:42.954Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (OFFSHORE_PIPELINE, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:20:51 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"Runner should be idle. Resetting jobs. Retrying in 300 secs"}
Mar 24 17:21:09 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:21:09.794Z","job":"gh","level":"info","message":"runner status Job Run Optimisations (OFFSHORE_PIPELINE, 0) completed with result: Succeeded","repo":"https://github.com/continuum-industries/Pareto","status":"job_ended","success":true}
Mar 24 17:21:15 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:21:15.077Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (REAL_PIPELINE_MOATA_430, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:25:43 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:25:43.309Z","job":"gh","level":"info","message":"runner status Job Run Optimisations (REAL_PIPELINE_MOATA_430, 0) completed with result: Succeeded","repo":"https://github.com/continuum-industries/Pareto","status":"job_ended","success":true}
Mar 24 17:25:46 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:25:46.246Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (SCOTWIND_2, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:25:51 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"Runner should be idle. Resetting jobs. Retrying in 300 secs"}
Mar 24 17:30:51 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"runner status","reason":"timeout:300","status":"terminated"}
Mar 24 17:30:51 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"waiting 10 seconds before exiting..."}
Mar 24 17:31:01 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"Unregistering runner cml-wdw16iotk7..."}
Mar 24 17:31:03 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"\tFailed: Bad request - Runner \"cml-wdw16iotk7\" is still running a job\""}

Any idea what might be going on here?

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 3
  • Comments: 19 (14 by maintainers)

Most upvoted comments

the Ghost is back! 😢

@0x2b3bfa0 any luck on this? It’s causing a lot of problems with our CI pipeline so would be great to fix asap!

@thatGreekGuy96 My immediate bypass that has worked in the past is to extend the timeout in a different way, use --idle-timeout=21600 / 6h ie greater than what your expected runtime will be. Previously this issue has been intermittent and extremely hard to find what the root cause is. you can take a look here as well for the previous context of something similar

Sounds like some sort of regression introduced with #689… 🙈 I’ll take a look a bit later today and try to find if that’s the case.

Opened here: https://github.com/iterative/cml/issues/973 @0x2b3bfa0

I guess you can mark this issue as closed now.

@thatGreekGuy96, are you still experiencing those unexpected delays? If so, please open a new issue with the details. 🙏🏼

GitHub Response

Apologies for the trouble here.

Engineering is reporting that the underlying problem (which is related to Elasticsearch) has been fixed but the results returned from the filter might not be complete until we have a new index. This could take a couple of days, sorry.

There is a discussion about this in the community and Engineering has responded - https://github.community/t/workflow-call-has-an-inconsistent-status-in-github-rest-api/244164. A banner will also be displayed in the UI to alert you of this issue when you try to filter the workflow runs. Accessing the endpoint without any filter should return all the workflow runs correctly.

I’ll let you know once the index fully rebuilds.

@DavidGOrtega, this will interest you for sure. 😄

🔔 @thatGreekGuy96, @dacbd and @mattlbeck, it should be fixed with v0.14.1

Im very sure that this is a new issue. I had the runner working for hours as a test.

Cause

Inconsistent behavior in the GitHub REST API:

Reproducing

  1. Create a GitHub repository and create a file named .github/workflows/test.yml with this content:

    on: workflow_dispatch
    jobs:
      test:
        runs-on: ubuntu-latest
        steps:
        - run: sleep infinity
    
  2. Trigger the workflow with the workflow_dispatch event (instructions)

  3. Set the following environment variables:

    • GITHUB_TOKENpersonal access token
    • REPOSITORY_OWNER — user/organization who owns the repository; e.g. github.com/user/repository
    • REPOSITORY_NAME — name of the repository; e.g. github.com/user/repository
  4. Run the following snippets

No status specified

$ curl "https://api.github.com/repos/$REPOSITORY_OWNER/$REPOSITORY_NAME/actions/runs" \
>   --header "Accept: application/vnd.github.v3+json" \
>   --user ":$GITHUB_TOKEN" |
> jq '.workflow_runs[] | .status'
"in_progress"
"completed"
...
"completed"

Setting status to completed

$ curl "https://api.github.com/repos/$REPOSITORY_OWNER/$REPOSITORY_NAME/actions/runs?status=completed" \
>   --header "Accept: application/vnd.github.v3+json" \
>   --user ":$GITHUB_TOKEN" |
> jq '.workflow_runs[] | .status'
"completed"
"completed"
...
"completed"

Setting status to in_progress (or any other value)

$ curl "https://api.github.com/repos/$REPOSITORY_OWNER/$REPOSITORY_NAME/actions/runs?status=in_progress" \
>   --header "Accept: application/vnd.github.v3+json" \
>   --user ":$GITHUB_TOKEN"
{
  "total_count": 0,
  "workflow_runs": [

  ]
}

I am having the same issue on several GitHub projects, but using the same AWS account. We can’t seem to run any workflows that make it past the alloted timeout.

https://discord.com/channels/485586884165107732/728693131557732403/963180471326490734

No worries @0x2b3bfa0, i’ll do that asap. I’ll get you the logs asap but from what I can tell the issue is still occurring.