cml: cml-py3 doesn't see GPU resources

When I try to run a train task on our AWS own infra the following error is raised on Github actions:

...
cfed3d9d6c7f: Pull complete
b5f3fa781593: Pull complete
53448e1579d7: Pull complete
c17eb7b4b5ac: Pull complete
25af3821284d: Pull complete
ea9f7c675b08: Pull complete
6522e7c5ced1: Pull complete
5fb2b6b033bf: Pull complete
1d90b6421d53: Pull complete
5d8a82854f4e: Pull complete
6fa3b0a92e5c: Pull complete
Digest: sha256:2e99adfe066a4383e3d391e5d4f1fbebc37b2c3d8f33ab883e810b35dd771965
Status: Downloaded newer image for dvcorg/cml-py3:latest
dfae88d60614134c3aeb2dc9095356b8cd545e1ad521f7db6575b518fe3ad679
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
About to remove cml1596929922
WARNING: This action will delete both local reference and remote instance.
Successfully removed cml1596929922

the workflow looks like this:

name: train-model

on: [push]

jobs:
  deploy-cloud-runner:
    runs-on: [service-catalog, linux, x64]
    container: docker://dvcorg/cml

    steps:
      - name: deploy
        env:
          repo_token: ${{ secrets.REPO_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_EC2 }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_EC2 }}
        run: |
          echo "Deploying..."
          distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
          curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
          curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
          apt-get update && apt-get install -y nvidia-container-toolkit
          RUNNER_LABELS="cml,aws"
          RUNNER_REPO="https://github.com/${GITHUB_REPOSITORY}"
          MACHINE="cml$(date +%s)"
          docker-machine create \
            --driver amazonec2 \
            --amazonec2-instance-type g3s.xlarge \
            --amazonec2-vpc-id vpc-xxxxxxxx \
            --amazonec2-region eu-west-1 \
            --amazonec2-zone "a" \
            --amazonec2-ssh-user ubuntu \
            --amazonec2-ami ami-089cc16f7f08c4457 \
            --amazonec2-root-size 10 \
            $MACHINE
          eval "$(docker-machine env --shell sh $MACHINE)"
          (
          docker-machine ssh $MACHINE "sudo mkdir -p \
            /docker_machine && \
          sudo chmod 777 /docker_machine" && \
          docker-machine scp -r -q ~/.docker/machine/ \
            $MACHINE:/docker_machine && \
          docker run --name runner --gpus all -d \
            -v /docker_machine/machine:/root/.docker/machine \
            -e DOCKER_MACHINE=$MACHINE \
            -e repo_token=$repo_token \
            -e RUNNER_LABELS=$RUNNER_LABELS \
            -e RUNNER_REPO=$RUNNER_REPO \
            -e RUNNER_IDLE_TIMEOUT=120 \
            dvcorg/cml-py3:latest && \
          sleep 20 && echo "Deployed $MACHINE"
          ) || (docker-machine rm -y -f $MACHINE && exit 1)
  train:
# ....

We run the tests on a self hosted runner.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 21 (12 by maintainers)

Most upvoted comments

Thank you a lot @DavidGOrtega, the issue was indeed related to the token (its name was not matching anymore the workflow script). Now it’s fixed and the workflow works like a charm.

@pommedeterresautee (love your nick) checking

@pommedeterresautee

Your runner is being created to listen the tags RUNNER_LABELS="cml,aws" And your job runs on runs-on: [self-hosted,cml], so its missing a tag and there is not runner for it.

All the labels must be matched.