cml: cml-py3 doesn't see GPU resources
When I try to run a train task on our AWS own infra the following error is raised on Github actions:
...
cfed3d9d6c7f: Pull complete
b5f3fa781593: Pull complete
53448e1579d7: Pull complete
c17eb7b4b5ac: Pull complete
25af3821284d: Pull complete
ea9f7c675b08: Pull complete
6522e7c5ced1: Pull complete
5fb2b6b033bf: Pull complete
1d90b6421d53: Pull complete
5d8a82854f4e: Pull complete
6fa3b0a92e5c: Pull complete
Digest: sha256:2e99adfe066a4383e3d391e5d4f1fbebc37b2c3d8f33ab883e810b35dd771965
Status: Downloaded newer image for dvcorg/cml-py3:latest
dfae88d60614134c3aeb2dc9095356b8cd545e1ad521f7db6575b518fe3ad679
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
About to remove cml1596929922
WARNING: This action will delete both local reference and remote instance.
Successfully removed cml1596929922
the workflow looks like this:
name: train-model
on: [push]
jobs:
deploy-cloud-runner:
runs-on: [service-catalog, linux, x64]
container: docker://dvcorg/cml
steps:
- name: deploy
env:
repo_token: ${{ secrets.REPO_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_EC2 }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_EC2 }}
run: |
echo "Deploying..."
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
apt-get update && apt-get install -y nvidia-container-toolkit
RUNNER_LABELS="cml,aws"
RUNNER_REPO="https://github.com/${GITHUB_REPOSITORY}"
MACHINE="cml$(date +%s)"
docker-machine create \
--driver amazonec2 \
--amazonec2-instance-type g3s.xlarge \
--amazonec2-vpc-id vpc-xxxxxxxx \
--amazonec2-region eu-west-1 \
--amazonec2-zone "a" \
--amazonec2-ssh-user ubuntu \
--amazonec2-ami ami-089cc16f7f08c4457 \
--amazonec2-root-size 10 \
$MACHINE
eval "$(docker-machine env --shell sh $MACHINE)"
(
docker-machine ssh $MACHINE "sudo mkdir -p \
/docker_machine && \
sudo chmod 777 /docker_machine" && \
docker-machine scp -r -q ~/.docker/machine/ \
$MACHINE:/docker_machine && \
docker run --name runner --gpus all -d \
-v /docker_machine/machine:/root/.docker/machine \
-e DOCKER_MACHINE=$MACHINE \
-e repo_token=$repo_token \
-e RUNNER_LABELS=$RUNNER_LABELS \
-e RUNNER_REPO=$RUNNER_REPO \
-e RUNNER_IDLE_TIMEOUT=120 \
dvcorg/cml-py3:latest && \
sleep 20 && echo "Deployed $MACHINE"
) || (docker-machine rm -y -f $MACHINE && exit 1)
train:
# ....
We run the tests on a self hosted runner.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 21 (12 by maintainers)
Thank you a lot @DavidGOrtega, the issue was indeed related to the token (its name was not matching anymore the workflow script). Now it’s fixed and the workflow works like a charm.
@pommedeterresautee (love your nick) checking
@pommedeterresautee
Your runner is being created to listen the tags
RUNNER_LABELS="cml,aws"
And your job runs onruns-on: [self-hosted,cml]
, so its missing a tag and there is not runner for it.All the labels must be matched.