cml: Can't use AWS Instance GPU on GITLAB CI and CML-RUNNER

I have this gitlab-ci.yml:

stages:
  - test
  - deploy
  - train

sast:
  stage: test
include:
- template: Security/SAST.gitlab-ci.yml

deploy_job:
  stage: deploy
  when: always
  image: iterativeai/cml:0-dvc2-base1
  script:
    - cml-runner
      --cloud aws
      --cloud-region us-east-1
      --cloud-type g3.4xlarge
      --cloud-hdd-size 64
      --cloud-aws-security-group="cml-runners-sg"
      --labels=cml-runner-gpu
      --idle-timeout=120
train_job:
  stage: train
  when: on_success
  image: iterativeai/cml:0-dvc2-base1-gpu
  tags:
    - cml-runner-gpu
  before_script:
    - pip install poetry
    - poetry --version
    - poetry config virtualenvs.create false
    - poetry install -vv
    - nvdia-smi
  script:
    # DVC Stuff
    - dvc pull
    - dvc repro -m
    - dvc push
    # Report metrics
    - echo "## Metrics" >> report.md
    - echo "\`\`\`json" >> report.md
    - cat metrics/best-meta.json >> report.md
    - echo "\`\`\`" >> report.md
    # Report GPU details
    - echo "## GPU info" >> report.md
    - cat gpu_info.txt >> report.md
    # Send comment
    - cml-send-comment report.md

But, the container can’t recognize driver or GPU, on nvidia-smi command I had the following error:

/usr/bin/bash: line 133: nvdia-smi: command not found

I realized that iterativeai/cml:0-dvc2-base1-gpu can’t use instance GPU. How could I install nvidia drivers and the nvidia-docker and activate –gpus option on this docker?

Thank you

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 24 (12 by maintainers)

Most upvoted comments

Just adding the job log on CI of the deploy_job step: deploy_job.txt

and the train_job step: job_log.txt

I see nvdia-smi bash line: 125 ? There looks to be typo in your job?

If nvidia-smi works, these lines won’t run at all.

I connected to the deployed instance and managed to execute the nvidia-smi command:

Having seen that nvidia-smi works cml should have setup the runner with the nvidia executor automatically

https://github.com/iterative/cml/blob/e3382668396674d22390d8cfc3403ef1e67dd8eb/src/drivers/gitlab.js#L204

@leoitcode My guess is that it is having trouble parsing the output from cat as an argument with the spaces/dashes, and the like…

thinking there might be a bug in handling that option, this is the first time I’ve seen it used. In the meantime, you can probably use the AWS web console to connect to the instance instead of trying to pass your private key.

@dacbd I managed to make it work by adding EOF to my pem file:

<< EOF
-----BEGIN RSA PRIVATE KEY-----
MY PRIVATE KEY HERE
-----END RSA PRIVATE KEY-----
EOF

To be honest I have no idea of how this works, I just imagined it could be that by looking at what you did here: https://github.com/iterative/terraform-provider-iterative/pull/232#issuecomment-952375277

Maybe there is a more elegant way of doing this 😆

What about this one @dacbd ? Can I contribute somehow with this?

That would be amazing! You could create a PR in TPI

Just adding the job log on CI of the deploy_job step: deploy_job.txt and the train_job step: job_log.txt

I see nvdia-smi bash line: 125 ? There looks to be typo in your job?

O.o’’ @dacbd I thank you so much… I can’t believe we couldn’t see it…

We still can’t make this work, is there any other thing we can try? Or any other information, log etc that we can provide?