cml: cml runner fails to provision VM on Azure

I used the following as simple “hello world” for cml runner on GitLab Community Edition [14.10.1]:

deploy-runner:
  image: iterativeai/cml:latest
  script:
    - |
      cml runner \
          --cloud=azure \
          --cloud-region=eu-west \
          --cloud-type=s \
          --cloud-spot \
          --labels=cml-vm

train-model:
  needs: [deploy-runner]
  tags:
    - cml-vm
  image: ubuntu:latest
  script:
    - echo "hello"

I set up

Unfortunately this results in:

$ cml runner \ # collapsed multi-line command
{"level":"info","message":"Preparing workdir /home/runner..."}
{"level":"info","message":"Deploying cloud runner plan..."}
{"level":"info","message":"Terraform apply..."}
{"level":"error","message":"terraform -chdir='/home/runner' apply -auto-approve\n\t\nTerraform used the selected providers to generate the following execution\nplan. Resource actions are indicated with the following symbols:\n  + create\n\nTerraform will perform the following actions:\n\n  # iterative_cml_runner.runner will be created\n  + resource \"iterative_cml_runner\" \"runner\" {\n      + cloud                = \"azure\"\n      + cml_version          = \"0.15.2\"\n      + docker_volumes       = []\n      + driver               = \"gitlab\"\n      + id                   = (known after apply)\n      + idle_timeout         = 300\n      + instance_hdd_size    = 35\n      + instance_ip          = (known after apply)\n      + instance_launch_time = (known after apply)\n      + instance_type        = \"s\"\n      + labels               = \"cml-vm\"\n      + name                 = \"cml-elde6fnyv0\"\n      + region               = \"eu-west\"\n      + repo                 = \"https://git.eon-cds.de/F18771/cml-runner-blobfuse\"\n      + single               = false\n      + spot                 = true\n      + spot_price           = -1\n      + ssh_public           = (known after apply)\n      + token                = (sensitive value)\n    }\n\nPlan: 1 to add, 0 to change, 0 to destroy.\niterative_cml_runner.runner: Creating...\niterative_cml_runner.runner: Still creating... [10s elapsed]\niterative_cml_runner.runner: Still creating... [20s elapsed]\niterative_cml_runner.runner: Still creating... [30s elapsed]\niterative_cml_runner.runner: Still creating... [40s elapsed]\niterative_cml_runner.runner: Still creating... [50s elapsed]\niterative_cml_runner.runner: Still creating... [1m0s elapsed]\niterative_cml_runner.runner: Still creating... [1m10s elapsed]\niterative_cml_runner.runner: Still creating... [1m20s elapsed]\niterative_cml_runner.runner: Still creating... [1m30s elapsed]\niterative_cml_runner.runner: Still creating... [1m40s elapsed]\niterative_cml_runner.runner: Still creating... [1m50s elapsed]\niterative_cml_runner.runner: Still creating... [2m0s elapsed]\niterative_cml_runner.runner: Still creating... [2m10s elapsed]\niterative_cml_runner.runner: Still creating... [2m20s elapsed]\niterative_cml_runner.runner: Still creating... [2m30s elapsed]\niterative_cml_runner.runner: Still creating... [2m40s elapsed]\niterative_cml_runner.runner: Still creating... [2m50s elapsed]\niterative_cml_runner.runner: Still creating... [3m0s elapsed]\niterative_cml_runner.runner: Still creating... [3m10s elapsed]\niterative_cml_runner.runner: Still creating... [3m20s elapsed]\niterative_cml_runner.runner: Still creating... [3m30s elapsed]\niterative_cml_runner.runner: Still creating... [3m40s elapsed]\n\n\t╷\n│ Error: Failed creating the machine: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/358ccc6d-ab83-4b18-a484-e992f284b7cc/resourcegroups/iterative-37d31qzqeb13b?api-version=2020-06-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: <?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n│ <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n│ \t\t \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n│ <html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n│  <head>\n│   <title>404 - Not Found</title>\n│  </head>\n│  <body>\n│   <h1>404 - Not Found</h1>\n│  </body>\n│ </html>\n│  Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=[MASKED]&resource=https%3A%2F%2Fmanagement.azure.com%2F\n│ \n│   with iterative_cml_runner.runner,\n│   on main.tf line 8, in resource \"iterative_cml_runner\" \"runner\":\n│    8: resource \"iterative_cml_runner\" \"runner\" {\n│ \n╵\n","stack":"Error: terraform -chdir='/home/runner' apply -auto-approve\n\t\nTerraform used the selected providers to generate the following execution\nplan. Resource actions are indicated with the following symbols:\n  + create\n\nTerraform will perform the following actions:\n\n  # iterative_cml_runner.runner will be created\n  + resource \"iterative_cml_runner\" \"runner\" {\n      + cloud                = \"azure\"\n      + cml_version          = \"0.15.2\"\n      + docker_volumes       = []\n      + driver               = \"gitlab\"\n      + id                   = (known after apply)\n      + idle_timeout         = 300\n      + instance_hdd_size    = 35\n      + instance_ip          = (known after apply)\n      + instance_launch_time = (known after apply)\n      + instance_type        = \"s\"\n      + labels               = \"cml-vm\"\n      + name                 = \"cml-elde6fnyv0\"\n      + region               = \"eu-west\"\n      + repo                 = \"https://git.eon-cds.de/F18771/cml-runner-blobfuse\"\n      + single               = false\n      + spot                 = true\n      + spot_price           = -1\n      + ssh_public           = (known after apply)\n      + token                = (sensitive value)\n    }\n\nPlan: 1 to add, 0 to change, 0 to destroy.\niterative_cml_runner.runner: Creating...\niterative_cml_runner.runner: Still creating... [10s elapsed]\niterative_cml_runner.runner: Still creating... [20s elapsed]\niterative_cml_runner.runner: Still creating... [30s elapsed]\niterative_cml_runner.runner: Still creating... [40s elapsed]\niterative_cml_runner.runner: Still creating... [50s elapsed]\niterative_cml_runner.runner: Still creating... [1m0s elapsed]\niterative_cml_runner.runner: Still creating... [1m10s elapsed]\niterative_cml_runner.runner: Still creating... [1m20s elapsed]\niterative_cml_runner.runner: Still creating... [1m30s elapsed]\niterative_cml_runner.runner: Still creating... [1m40s elapsed]\niterative_cml_runner.runner: Still creating... [1m50s elapsed]\niterative_cml_runner.runner: Still creating... [2m0s elapsed]\niterative_cml_runner.runner: Still creating... [2m10s elapsed]\niterative_cml_runner.runner: Still creating... [2m20s elapsed]\niterative_cml_runner.runner: Still creating... [2m30s elapsed]\niterative_cml_runner.runner: Still creating... [2m40s elapsed]\niterative_cml_runner.runner: Still creating... [2m50s elapsed]\niterative_cml_runner.runner: Still creating... [3m0s elapsed]\niterative_cml_runner.runner: Still creating... [3m10s elapsed]\niterative_cml_runner.runner: Still creating... [3m20s elapsed]\niterative_cml_runner.runner: Still creating... [3m30s elapsed]\niterative_cml_runner.runner: Still creating... [3m40s elapsed]\n\n\t╷\n│ Error: Failed creating the machine: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/358ccc6d-ab83-4b18-a484-e992f284b7cc/resourcegroups/iterative-37d31qzqeb13b?api-version=2020-06-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: <?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n│ <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n│ \t\t \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n│ <html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n│  <head>\n│   <title>404 - Not Found</title>\n│  </head>\n│  <body>\n│   <h1>404 - Not Found</h1>\n│  </body>\n│ </html>\n│  Endpoint http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=[MASKED]&resource=https%3A%2F%2Fmanagement.azure.com%2F\n│ \n│   with iterative_cml_runner.runner,\n│   on main.tf line 8, in resource \"iterative_cml_runner\" \"runner\":\n│    8: resource \"iterative_cml_runner\" \"runner\" {\n│ \n╵\n\n    at /usr/lib/node_modules/@dvcorg/cml/src/utils.js:20:27\n    at ChildProcess.exithandler (node:child_process:406:5)\n    at ChildProcess.emit (node:events:527:28)\n    at maybeClose (node:internal/child_process:1092:16)\n    at Process.ChildProcess._handle.onexit (node:internal/child_process:302:5)","status":"terminated"}
{"level":"info","message":"waiting 10 seconds before exiting..."}

I and my team could not understand what the problem is.

Additional info:

  • I tried to use cml also with the iterativeai/cml:0-dvc2-base1 docker image
  • I tried to use Azure specific type and region, but no success

Any help would be very much appreciated.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (16 by maintainers)

Most upvoted comments

Is this new option for GCP and AWS only? no Azure?

Yes, still not supported on Azure, but please upvote & consider watching the following issues:

Note that --cloud-permission-set is not related to your issue, though: it’s just to use managed identities inside your workflows.

@iterative/cml, any objections to <kbd>wontfix</kbd> for now?

Possible solutions

  • Allow cml runner to create and delete resource groups at will, as long as their name matches a pattern[^1]
  • Use a subscription to isolate runners instead of a resource group, as suggested on cml#1019 (comment)
  • Use a fixed deployment (i.e. the officially recommended solution); hard to implement

[^1]: Azure role assignment contitions are still in preview 🙃

To recap @0x2b3bfa0 / @francesco086 the issue is in az nested resource groups arent supported and we are using a resource group to clean up all resources with a single API call, but here they have credentials that are only valid for a predefined resource group.

@0x2b3bfa0 thank you! Then I guess that is the source of the problem, I think we can close the issue.

For us (@lleiding is a colleague of mine) this may be a bit problematic. In our team we are trying to set things so to that each project has its own resource group. So what Ileiding asked could become a feature request: “Is there any way to customize this behavior (i.e. supply an existing resource group)?”

Closing as per https://github.com/iterative/cml/issues/1019#issuecomment-1139778698; @francesco086, feel free to reopen this issue if you deem it opportune.

In our team we are trying to set things so to that each project has its own resource group.

Do you have the possibility of having a separate subscription for every team instead?

So what Ileiding asked could become a feature request: “Is there any way to customize this behavior (i.e. supply an existing resource group)?”

Feel free to open a follow-up issue, although it’s unlikely that we will implement it anytime soon. The current functionality relies heavily on the fact of deleting a whole resource group with a single API call. 😅

Is this new option for GCP and AWS only? no Azure?

See this guide and the permissions/az directory in the provider repository for a list of required permissions.

Am I right in assuming that cml tries to create a resource group dedicated to the VM used for the runner? Is there any way to customize this behavior (i.e. supply an existing resource group)? Because I’m not convinced incorrect credentials are the whole story. Rather I suspect that CML tries to create the resource group, but fails (because the SP doesn’t have the necessary role), and then tries to obtain an access token to a resource group (iterative-37d31qzqeb13b in our example) that doesn’t exist (status code is 404 after all).