cml: CML seemingly fails to restart job after AWS Spot instances have been shut down
Hey everyone, So I noticed a couple of days ago that CML now has new functionality that allows it to restart workflows if one or more AWS spot runners have been told to shut down. However this doesn’t seem to be happening for me.
A couple of details about our case:
- Our cloud is AWS
- We’re (as far as i can tell) using the
latest
version of CML to deploy a bunch of runners as shown below.
deploy_runners:
name: Deploy Cloud Instances
needs: [setup_config]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: iterative/setup-cml@v1
with:
version: latest
- name: "Deploy runner on EC2"
shell: bash
env:
repo_token: ${{ secrets.ACCESS_TOKEN_CML }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_TESTING }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_TESTING }}
CASE_NAME: ${{ matrix.case_name }}
N_RUNNERS: ${{ fromJson(needs.setup_config.outputs.json_string).n_runners }}
run: |
for (( i=1; i<=N_RUNNERS; i++ ))
do
echo "Deploying runner ${i}"
cml-runner \
--cloud aws \
--cloud-region eu-west-2 \
--cloud-type=m \
--cloud-hdd-size 100 \
--cloud-spot \
--labels=cml-runner &
done
wait
echo "Deployed ${N_RUNNERS} runners."
- The job each runner runs does not use the CML images provided by iterative
- The job that each runner runs has
continue-on-error
set toFalse
(wondering whether that is interfering with cml?)
run_optimisation:
continue-on-error: false
strategy:
matrix: ${{fromJson(needs.setup_config.outputs.json_string).matrix}}
fail-fast: true
runs-on: [self-hosted, "cml-runner"]
container:
image: python:3.8.10-slim
volumes:
- /dev/shm:/dev/shm
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15 (9 by maintainers)
Awesome, thank you!
I ran another test and now i’m a little confused: I followed your instructions again and got a similar log (log.txt)
On the EC2 console, I can see that the instance in question has indeed been terminated.
HOWEVER, the spot request that created it still has a “fulfilled” status AND the github actions job is still running…
I hope this makes more sense to you than it does to me!
Update: After about 5 minutes, the spot request was marked as
terminated-by-user
, but the github actions job is still running… As far as I can tell, no new spot requests have been made.This is what I got from the logs: log.txt
Interestingly though, the GitHub actions job didn’t crash! It’s still running as far as I can tell. Is this the expected behaviour?
Graceful shutdown issues are really fun to debug; see https://github.com/iterative/terraform-provider-iterative/issues/90 for an example. 🙃 It would be awesome if you could reproduce this issue when spawning a single runner and follow the instructions below to see what’s failing.
Generate a new RSA PEM private key on your local system for debugging purposes:
Store the contents of
key.pem
as a repository secret namedINSTANCE_PRIVATE_SSH_KEY
as per the workflow below.Run the following workflow and retrieve the instance IP address from the logs of the
deploy
step:Monitor the instance logs from your local system by using the generated key as an indentity file:
Once the spot instance gets evicted, take a look at the logs and attach them to this issue.