cml: Can't use AWS Instance GPU on GITLAB CI and CML-RUNNER

I have this gitlab-ci.yml:

stages:
  - test
  - deploy
  - train

sast:
  stage: test
include:
- template: Security/SAST.gitlab-ci.yml

deploy_job:
  stage: deploy
  when: always
  image: iterativeai/cml:0-dvc2-base1
  script:
    - cml-runner
      --cloud aws
      --cloud-region us-east-1
      --cloud-type g3.4xlarge
      --cloud-hdd-size 64
      --cloud-aws-security-group="cml-runners-sg"
      --labels=cml-runner-gpu
      --idle-timeout=120
train_job:
  stage: train
  when: on_success
  image: iterativeai/cml:0-dvc2-base1-gpu
  tags:
    - cml-runner-gpu
  before_script:
    - pip install poetry
    - poetry --version
    - poetry config virtualenvs.create false
    - poetry install -vv
    - nvdia-smi
  script:
    # DVC Stuff
    - dvc pull
    - dvc repro -m
    - dvc push
    # Report metrics
    - echo "## Metrics" >> report.md
    - echo "\`\`\`json" >> report.md
    - cat metrics/best-meta.json >> report.md
    - echo "\`\`\`" >> report.md
    # Report GPU details
    - echo "## GPU info" >> report.md
    - cat gpu_info.txt >> report.md
    # Send comment
    - cml-send-comment report.md

But, the container can’t recognize driver or GPU, on nvidia-smi command I had the following error:

/usr/bin/bash: line 133: nvdia-smi: command not found

I realized that iterativeai/cml:0-dvc2-base1-gpu can’t use instance GPU. How could I install nvidia drivers and the nvidia-docker and activate –gpus option on this docker?

Thank you

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 24 (12 by maintainers)

Most upvoted comments

Just adding the job log on CI of the deploy_job step: deploy_job.txt

and the train_job step: job_log.txt

I see nvdia-smi bash line: 125 ? There looks to be typo in your job?

dacbd on Dec 22, 2021

If nvidia-smi works, these lines won’t run at all.

0x2b3bfa0 on Dec 18, 2021

I connected to the deployed instance and managed to execute the nvidia-smi command:

Having seen that nvidia-smi works cml should have setup the runner with the nvidia executor automatically

https://github.com/iterative/cml/blob/e3382668396674d22390d8cfc3403ef1e67dd8eb/src/drivers/gitlab.js#L204

DavidGOrtega on Dec 18, 2021

@leoitcode My guess is that it is having trouble parsing the output from cat as an argument with the spaces/dashes, and the like…

thinking there might be a bug in handling that option, this is the first time I’ve seen it used. In the meantime, you can probably use the AWS web console to connect to the instance instead of trying to pass your private key.

@dacbd I managed to make it work by adding EOF to my pem file:

<< EOF
-----BEGIN RSA PRIVATE KEY-----
MY PRIVATE KEY HERE
-----END RSA PRIVATE KEY-----
EOF

To be honest I have no idea of how this works, I just imagined it could be that by looking at what you did here: https://github.com/iterative/terraform-provider-iterative/pull/232#issuecomment-952375277

Maybe there is a more elegant way of doing this 😆

gitdoluquita on Dec 17, 2021

What about this one @dacbd ? Can I contribute somehow with this?

That would be amazing! You could create a PR in TPI

DavidGOrtega on Dec 22, 2021

Just adding the job log on CI of the deploy_job step: deploy_job.txt and the train_job step: job_log.txt

I see nvdia-smi bash line: 125 ? There looks to be typo in your job?

O.o’’ @dacbd I thank you so much… I can’t believe we couldn’t see it…

leoitcode on Dec 22, 2021

We still can’t make this work, is there any other thing we can try? Or any other information, log etc that we can provide?

gitdoluquita on Dec 22, 2021