cml: CML seemingly fails to restart job after AWS Spot instances have been shut down

Hey everyone, So I noticed a couple of days ago that CML now has new functionality that allows it to restart workflows if one or more AWS spot runners have been told to shut down. However this doesn’t seem to be happening for me.

A couple of details about our case:

  • Our cloud is AWS
  • We’re (as far as i can tell) using the latest version of CML to deploy a bunch of runners as shown below.
  deploy_runners:
    name: Deploy Cloud Instances
    needs: [setup_config]

    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2
      - uses: iterative/setup-cml@v1
        with:
          version:  latest

      - name: "Deploy runner on EC2"
        shell: bash
        env:
          repo_token: ${{ secrets.ACCESS_TOKEN_CML }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_TESTING }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_TESTING }}
          CASE_NAME: ${{ matrix.case_name }}
          N_RUNNERS: ${{ fromJson(needs.setup_config.outputs.json_string).n_runners }}

        run: |
          for (( i=1; i<=N_RUNNERS; i++ ))
          do
            echo "Deploying runner ${i}"
            cml-runner \
            --cloud aws \
            --cloud-region eu-west-2 \
            --cloud-type=m \
            --cloud-hdd-size 100 \
            --cloud-spot \
            --labels=cml-runner &
          done
          wait
          echo "Deployed ${N_RUNNERS} runners."
  • The job each runner runs does not use the CML images provided by iterative
  • The job that each runner runs has continue-on-error set to False (wondering whether that is interfering with cml?)
  run_optimisation:
    continue-on-error: false
    strategy:
      matrix: ${{fromJson(needs.setup_config.outputs.json_string).matrix}}
      fail-fast: true

    runs-on: [self-hosted, "cml-runner"]
    container:
      image: python:3.8.10-slim
      volumes:
          - /dev/shm:/dev/shm

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15 (9 by maintainers)

Most upvoted comments

Awesome, thank you!

I ran another test and now i’m a little confused: I followed your instructions again and got a similar log (log.txt)

On the EC2 console, I can see that the instance in question has indeed been terminated.

HOWEVER, the spot request that created it still has a “fulfilled” status AND the github actions job is still running…

I hope this makes more sense to you than it does to me!

Update: After about 5 minutes, the spot request was marked as terminated-by-user, but the github actions job is still running… As far as I can tell, no new spot requests have been made.

This is what I got from the logs: log.txt

Interestingly though, the GitHub actions job didn’t crash! It’s still running as far as I can tell. Is this the expected behaviour?

Graceful shutdown issues are really fun to debug; see https://github.com/iterative/terraform-provider-iterative/issues/90 for an example. 🙃 It would be awesome if you could reproduce this issue when spawning a single runner and follow the instructions below to see what’s failing.

  1. Generate a new RSA PEM private key on your local system for debugging purposes:

    $ ssh-keygen -t rsa -m pem -b 4096 -f key.pem
    
  2. Store the contents of key.pem as a repository secret named INSTANCE_PRIVATE_SSH_KEY as per the workflow below.

  3. Run the following workflow and retrieve the instance IP address from the logs of the deploy step:

    jobs:
      deploy:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v2
          - uses: iterative/setup-cml@v1
          - run: >-
              cml-runner
              --labels=debug
              --cloud=aws
              --cloud-type=m
              --cloud-hdd-size=100
              --cloud-region=eu-west-2
              --cloud-ssh-private="$INSTANCE_PRIVATE_SSH_KEY"
              --cloud-spot
            env:
              REPO_TOKEN: ${{ secrets.ACCESS_TOKEN_CML }}
              AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_TESTING }}
              AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_TESTING }}
              INSTANCE_PRIVATE_SSH_KEY: ${{ secrets.INSTANCE_PRIVATE_SSH_KEY }}
      run:
        needs: deploy
        runs-on: [self-hosted, debug]
        steps:
          run: cat
    
  4. Monitor the instance logs from your local system by using the generated key as an indentity file:

    $ ssh -i key.pem ubuntu@IP_ADDRESS journalctl --follow --unit cml
    
  5. Once the spot instance gets evicted, take a look at the logs and attach them to this issue.