cml: `cml-runner` shut down due to idle timeout while still busy.
Hey everyone, started getting this error a couple of weeks ago, where jobs on GHA are failing with 143
error codes. Did a little digging and it seems like the error is because of the --idle-timeout
parameter, which is by default set to 300
.
Seems like the job thinks it’s idle and gets shut down, but in fact it is actually still running!
Not sure why this is happening, setting --idle-timeout=-1
resolves the issue, but I don’t want to do that for obvious reasons.
I’m using cml version 0.12.0
(which addressed a similar issue from what I can see? -> https://github.com/iterative/cml/issues/808 ).
Here are the logs from the runners:
root@ip-172-31-43-163:~# sudo journalctl -u cml.service -f
-- Logs begin at Tue 2022-02-22 16:02:55 UTC. --
Mar 24 17:10:29 ip-172-31-43-163 cml.sh[1869]: sshd override added, restarting daemon
Mar 24 17:10:30 ip-172-31-43-163 sudo[2799]: pam_unix(sudo:session): session closed for user root
Mar 24 17:10:37 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"Preparing workdir /tmp/tmp.NZ4U6HZb1S/.cml/cml-wdw16iotk7..."}
Mar 24 17:10:37 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"Launching github runner"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"EC2 id i-0c767456d3e8c3eec"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.327Z","level":"info","message":"runner status","repo":"https://github.com/continuum-industries/Pareto"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.328Z","level":"info","message":"runner status √ Connected to GitHub","repo":"https://github.com/continuum-industries/Pareto"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.910Z","level":"info","message":"runner status Current runner version: '2.289.1'","repo":"https://github.com/continuum-industries/Pareto"}
Mar 24 17:10:50 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:10:50.911Z","level":"info","message":"runner status Listening for Jobs","repo":"https://github.com/continuum-industries/Pareto","status":"ready"}
Mar 24 17:11:18 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:11:18.401Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (BASELINE_GREENLINK_ME_CASE_04, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:15:50 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"Runner should be idle. Resetting jobs. Retrying in 300 secs"}
Mar 24 17:16:40 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:16:40.360Z","job":"gh","level":"info","message":"runner status Job Run Optimisations (BASELINE_GREENLINK_ME_CASE_04, 0) completed with result: Succeeded","repo":"https://github.com/continuum-industries/Pareto","status":"job_ended","success":true}
Mar 24 17:16:42 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:16:42.954Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (OFFSHORE_PIPELINE, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:20:51 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"Runner should be idle. Resetting jobs. Retrying in 300 secs"}
Mar 24 17:21:09 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:21:09.794Z","job":"gh","level":"info","message":"runner status Job Run Optimisations (OFFSHORE_PIPELINE, 0) completed with result: Succeeded","repo":"https://github.com/continuum-industries/Pareto","status":"job_ended","success":true}
Mar 24 17:21:15 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:21:15.077Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (REAL_PIPELINE_MOATA_430, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:25:43 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:25:43.309Z","job":"gh","level":"info","message":"runner status Job Run Optimisations (REAL_PIPELINE_MOATA_430, 0) completed with result: Succeeded","repo":"https://github.com/continuum-industries/Pareto","status":"job_ended","success":true}
Mar 24 17:25:46 ip-172-31-43-163 cml.sh[1869]: {"date":"2022-03-24T17:25:46.246Z","job":"gh","level":"info","message":"runner status Running job: Run Optimisations (SCOTWIND_2, 0)","repo":"https://github.com/continuum-industries/Pareto","status":"job_started"}
Mar 24 17:25:51 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"Runner should be idle. Resetting jobs. Retrying in 300 secs"}
Mar 24 17:30:51 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"runner status","reason":"timeout:300","status":"terminated"}
Mar 24 17:30:51 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"waiting 10 seconds before exiting..."}
Mar 24 17:31:01 ip-172-31-43-163 cml.sh[1869]: {"level":"info","message":"Unregistering runner cml-wdw16iotk7..."}
Mar 24 17:31:03 ip-172-31-43-163 cml.sh[1869]: {"level":"error","message":"\tFailed: Bad request - Runner \"cml-wdw16iotk7\" is still running a job\""}
Any idea what might be going on here?
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 3
- Comments: 19 (14 by maintainers)
the Ghost is back! 😢
@thatGreekGuy96 My immediate bypass that has worked in the past is to extend the timeout in a different way, use
--idle-timeout=21600
/ 6h ie greater than what your expected runtime will be. Previously this issue has been intermittent and extremely hard to find what the root cause is. you can take a look here as well for the previous context of something similarSounds like some sort of regression introduced with #689… 🙈 I’ll take a look a bit later today and try to find if that’s the case.
Opened here: https://github.com/iterative/cml/issues/973 @0x2b3bfa0
I guess you can mark this issue as closed now.
@thatGreekGuy96, are you still experiencing those unexpected delays? If so, please open a new issue with the details. 🙏🏼
GitHub Response
@DavidGOrtega, this will interest you for sure. 😄
🔔 @thatGreekGuy96, @dacbd and @mattlbeck, it should be fixed with
v0.14.1
Im very sure that this is a new issue. I had the runner working for hours as a test.
Cause
Inconsistent behavior in the GitHub REST API:
/repos/{owner}/{repo}/actions/runs
returns a list of runs, some of them with{status: "in_progress"}
/repos/{owner}/{repo}/actions/runs?status=completed
returns a list of runs with{status: "completed"}
/repos/{owner}/{repo}/actions/runs?status=in_progress
always returns an empty list 🤯Reproducing
Create a GitHub repository and create a file named
.github/workflows/test.yml
with this content:Trigger the workflow with the
workflow_dispatch
event (instructions)Set the following environment variables:
GITHUB_TOKEN
— personal access tokenREPOSITORY_OWNER
— user/organization who owns the repository; e.g.github.com/user/repository
REPOSITORY_NAME
— name of the repository; e.g.github.com/user/repository
Run the following snippets
No
status
specifiedSetting
status
tocompleted
Setting
status
toin_progress
(or any other value)I am having the same issue on several GitHub projects, but using the same AWS account. We can’t seem to run any workflows that make it past the alloted timeout.
https://discord.com/channels/485586884165107732/728693131557732403/963180471326490734
No worries @0x2b3bfa0, i’ll do that asap. I’ll get you the logs asap but from what I can tell the issue is still occurring.