cml: Instances intermittently fail to terminate
I’ve had a couple of instances recently that have failed to terminate. In the most recent case this was with the --reuse
flag set, having run a series of 8 queued jobs.
The instance is sitting idle, with a timeout of 60s
having passed ten minutes ago. I’ll need to terminate the instance manually from the command line.
In the most serious case, I had an instance run for two weeks without terminating. It took so long for us to notice because the instance name did not get set to cml-*
as usual.
Here’s the yml we are using:
name: train and evaluate rasa model
on:
pull_request:
types: [opened, synchronize]
workflow_dispatch:
jobs:
deploy-runner:
runs-on: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
- uses: iterative/setup-cml@v1
- name: deploy
shell: bash
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
cml-runner \
--cloud aws \
--cloud-region eu-west \
--cloud-type=c5a.4xlarge \
--cloud-spot true \
--labels=cml-runner,voice-control,oms-rasa-2 \
--idle-timeout 60 \
--reuse
model-training:
needs: deploy-runner
runs-on: [self-hosted,cml-runner]
container: docker://dvcorg/cml:0-dvc2-base1
steps:
- uses: actions/checkout@v2
with:
ref: ${{ github.event.pull_request.head.sha }}
- uses: actions/setup-python@v2
with:
python-version: '3.8.5'
- name: Install dependencies
run: |
apt-get update -y
apt-get install make python3-pip virtualenv curl
- name: cml
env:
REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_REGION: eu-west-1
run: |
python --version
make virtualenv
dvc repro
echo "## Metrics" > report.md
git fetch --prune
dvc metrics diff main --show-md | grep "Change\|\-\-\-" >> report.md
dvc metrics diff main --show-md | grep -E "(intent|entity|action).*weighted" | sort >> report.md
sed "s/results\///g" -i report.md
cml-send-comment report.md
dvc push
- uses: actions/upload-artifact@v2
with:
name: gh-artifact-${{ github.event.pull_request.head.sha }}
path: |
report.md
results
retention-days: 30
- uses: EndBug/add-and-commit@v7
if: ${{ github.ref != 'refs/heads/main' }} && ${{ github.ref != 'refs/heads/rasax/prod' }}
with:
add: 'dvc.lock --force'
pull_strategy: 'NO-PULL'
message: 'chg: dvc repro'
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 24 (12 by maintainers)
Amazing, great work @DavidGOrtega 👏
I have observed this too, with instances started with
--single
staying up for days. As a workaround I am now usingecho "/sbin/poweroff" | /usr/bin/at now + 60 min
on startup to schedule a shutdown.(I have also had the no-name issue happen once).
I am also seeing that workflows that have already completed successfully get restarted which is probably related to this and #583
Thats very weird. Unless you have setup the name. If you never did that I would say thats not doable 🤔 Which was the name it had?