cml: Instances intermittently fail to terminate

I’ve had a couple of instances recently that have failed to terminate. In the most recent case this was with the --reuse flag set, having run a series of 8 queued jobs.

The instance is sitting idle, with a timeout of 60s having passed ten minutes ago. I’ll need to terminate the instance manually from the command line.

In the most serious case, I had an instance run for two weeks without terminating. It took so long for us to notice because the instance name did not get set to cml-* as usual.

Here’s the yml we are using:

name: train and evaluate rasa model

on:
  pull_request:
    types: [opened, synchronize]
  workflow_dispatch:

jobs:
  deploy-runner:
    runs-on: [ubuntu-latest]
    steps:
      - uses: actions/checkout@v2
      - uses: iterative/setup-cml@v1

      - name: deploy
        shell: bash
        env:
          REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          cml-runner \
          --cloud aws \
          --cloud-region eu-west \
          --cloud-type=c5a.4xlarge \
          --cloud-spot true \
          --labels=cml-runner,voice-control,oms-rasa-2 \
          --idle-timeout 60 \
          --reuse
  model-training:
    needs: deploy-runner
    runs-on: [self-hosted,cml-runner]
    container: docker://dvcorg/cml:0-dvc2-base1

    steps:
    - uses: actions/checkout@v2
      with: 
        ref: ${{ github.event.pull_request.head.sha }}

    - uses: actions/setup-python@v2
      with:
        python-version: '3.8.5'
    - name: Install dependencies
      run: |
        apt-get update -y
        apt-get install make python3-pip virtualenv curl
    - name: cml
      env:
        REPO_TOKEN: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        AWS_REGION: eu-west-1
      run: |
        python --version
        make virtualenv
        dvc repro
        echo "## Metrics" > report.md
        git fetch --prune
        dvc metrics diff main --show-md | grep "Change\|\-\-\-" >> report.md
        dvc metrics diff main --show-md | grep -E "(intent|entity|action).*weighted" | sort >> report.md
        sed "s/results\///g" -i report.md
        cml-send-comment report.md
        dvc push

    - uses: actions/upload-artifact@v2
      with:
        name: gh-artifact-${{ github.event.pull_request.head.sha }}
        path: |
          report.md
          results
        retention-days: 30
        
    - uses: EndBug/add-and-commit@v7
      if: ${{ github.ref != 'refs/heads/main' }} && ${{ github.ref != 'refs/heads/rasax/prod' }}
      with:
         add: 'dvc.lock --force'
         pull_strategy: 'NO-PULL'
         message: 'chg: dvc repro'

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 24 (12 by maintainers)

Most upvoted comments

Amazing, great work @DavidGOrtega 👏

I have observed this too, with instances started with --single staying up for days. As a workaround I am now using echo "/sbin/poweroff" | /usr/bin/at now + 60 min on startup to schedule a shutdown.

Screenshot 2021-07-14 at 12 31 50 copy

(I have also had the no-name issue happen once).

I am also seeing that workflows that have already completed successfully get restarted which is probably related to this and #583

the instance name did not get set to cml-* as usual.

Thats very weird. Unless you have setup the name. If you never did that I would say thats not doable 🤔 Which was the name it had?