azure-cli: azure-cli v2 - SweepJob fails when input is a registered dataset

Describe the bug When running a SweepJob (or any other job like commandJob) where inputs should be a registered dataset within the same workspace, the job gets submitted by CLI but at the UI, i get the following error.

{'code': data-capability.data-capability.FailedToInitializeCapabilitySession.start.UserErrorException, 'message': , 'target': , 'category': UserError, 'error_details': [{'key': NonCompliantReason, 'value': UserErrorException:
	Message: only eval_mount or eval_download modes are supported for v1 legacy dataset for mltable
	InnerException None
	ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "only eval_mount or eval_download modes are supported for v1 legacy dataset for mltable"
    }
}}, {'key': StackTrace, 'value':   File "/opt/miniconda/envs/data-capability/lib/python3.7/site-packages/data_capability/capability_session.py", line 128, in from_cr_config

Furthermore, the dataset is not recognized as an input when looking at the summary in the UI, I assume the schema valid for CLI does not match with the one used in the workspace ? azure_sweep_fail

The registered dataset is a file dataset.

Here is the yaml file I am using:

$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
display_name: credit-data-hyper-parameter-sweep-job
experiment_name: credit-data-hyper-parameter-sweep-job
description: Run a hyperparameter sweep job for Sklearn on german credit dataset
sampling_algorithm: random
type: sweep
inputs:
 credit:
   path: azureml:german-credit-dataset-via-jupyter:1
   type: uri_folder
search_space:
 n_estimators:
   type: choice
   values: [50,100,200,300]
 criterion:
   type: choice
   values: [gini, entropy]
 class_weight:
   type: choice
   values: [None, balanced, balanced_subsample]
objective:
 primary_metric: f1-score-test
 goal: maximize
# invoke completions (Ctrl+Space, Cmd+Space) to see the list of compute targets available
compute: azureml:cpu-cluster3
trial:
 code: ./src 
 command: python train.py 
         --data-path ${{inputs.credit}}
         --data-name german_credit_data.csv
         --n_estimators ${{search_space.n_estimators}} 
         --criterion ${{search_space.criterion}} 
         --class_weight ${{search_space.class_weight}}

 # invoke completions (Ctrl+Space, Cmd+Space) to see the list of environments available
 environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:15
limits:
 max_total_trials: 1
 max_concurrent_trials: 1
 timeout: 10000

the command to submit the job is:

az ml job create --subscription <MASKED> --resource-group <MASKED>--workspace-name <MASKED>--file path/to/file/hyperparameter_tuning.yml --stream

To Reproduce try to use any registered dataset within a sweepJob configuration

Expected behavior I expect, that the the HyperDrive detects the input dataset correctly.

Environment summary az version { “azure-cli”: “2.35.0”, “azure-cli-core”: “2.35.0”, “azure-cli-telemetry”: “1.0.6”, “extensions”: { “ml”: “2.2.3” } }

Additional context I do have the same issue with commandJobs. While it failed during submitting it via CLI, I was able to get it up and running when I clicked on “edit and submit” in the UI and just added the input dataset via the UI manually. The final yml of this working solution used the old inputs syntax, where datasets where referenced as datasets: azureml: rather than path: azureml: But this can not be used if you want to submit a Job via CLI since it is invalid based on the schema.

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 3
Comments: 21 (9 by maintainers)

Most upvoted comments

Before the March Public Preview release (v2.2.1), registered dataset assets in V2 - uri_folder and uri_file - were infact represented as a V1 FileDataset asset. In the March CLI release (v2.2.1):

az ml dataset subgroup is deprecated, please use az ml data instead.
uri_folder and uri_file are now first-class data V2 entities and they are no longer use the V1 FileDataset entity under-the-covers.
all existing V1 assets (both FileDataset and TabularDataset) are cast to an mltable. Please see the context section below for more details.

These changes introduced a breaking change and existing jobs consuming registered dataset assets will error with the message:

only eval_mount or eval_download modes are supported for v1 legacy dataset for mltable.

To mitigate this breaking change for registered assets, you have two options articulated below.

Option 1: Re-create dataset assets as data assets (preferred method)

The yaml definition of your data asset will be unchanged, for example it will look like:

# my-data.yaml
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: cloud-file-example
description: Data asset created from file in cloud.
type: uri_file # or uri_folder
path: azureml://datastores/workspaceblobstore/paths/example-data/titanic.csv # or other local/cloud path

Re-create the data asset using the new az ml data subground command:

az ml data create --file my-data.yaml

The registered data asset is now a bonafide uri_file/uri_folder asset rather than a V1 FileDataset.

Option 2: Update job yaml file

Update the inputs section of your job yaml from:

inputs:
    my_dataset:
        type: uri_folder
        path: azureml:mydataset:1

to:

inputs:
    my_dataset:
        type: mltable
        path: azureml:mydataset:1
        mode: eval_mount

The section below provides more context on why V1 registered assets are mapped to a new type called mltable.

Context

Prior to CLI v2.2.1 registered uri_folder and uri_file data assets in V2 were actually typed to a V1 FileDataset asset. In V1 of Azure Machine Learning, FileDataset and TabularDataset could make use of an accompanying data prep engine to do various data loading transformations on the data - for example, take a sample of files/records, filter files/records, etc. From CLI v2.2.1+ both uri_folder and uri_file are first-class asset types and there is no longer a dependency on V1 FileDataset, these types simply map cloud storage to your compute nodes (via mount or download) and do not provide any data loading transforms such as sample or filter.

To provide data loading transforms for both files and tables of data we are introducing a new type in V2 called mltable, which maintains the data prep engine. Given that all registered V1 data assets (FileDataset and TabularDataset) had a dependency on the data prep engine, we cast them to an mltable so they can continue to leverage the data prep engine.

Whilst backward compatibility is provided (see below), if your intention with your V1 FileDataset assets was to have a single path to a file or folder with no loading transforms (sample, take, filter, etc) then we recommend that you re-create them as a uri_file/uri_folder using the V2 CLI:

az ml data create --file my-data-asset.yaml

You can get backward compatibility with your registered V1 dataset in an Azure ML V2 job by using the following definition in the inputs section of your job yaml:

inputs:
    my_v1_dataset:
        type: mltable
        path: azureml:myv1ds:1
        mode: eval_mount

samuel100 on Apr 29, 2022

My current assumption is, that azure ml cli v2 introduced new types of datasets (uri-folder and uri file) which can currently only created with the new cli v2 and thus, sweepJob or commandJob are incompatible with ‘legacy’ datasets uploaded via the UI or python sdk.

MarkusDressel on Apr 26, 2022

Ok thanks @Man-MSFT. So if I have a CSV/Parquet file, I cannot mount it as long as it’s a TabularDataset. I have to recreate it as a FileDataset and it will then be mountable?

nmowatt on Jun 8, 2022

Ok thanks for the info @diondrapeck. May I ask how I can mount TabularDatasets (SQL Database) in the yaml file then?

nmowatt on May 31, 2022

I can confirm the backward compatibility with a registered V1 dataset in an Azure ML V2 job (ml extension version 2.3.1) using:

inputs:
    my_v1_dataset:
        type: mltable
        path: azureml:myv1ds:1
        mode: eval_mount

Would be great if you can reflect that in your documentation.

At least for me, this is my preferred solution. While I can understand that Option 1: Re-create dataset assets as data assets is the preferred method, this seems like an overkill assuming you have 200 datasets across your workspaces. I hope, that you will auto-convert legacy V1 datasets to V2 datasets once cli V2 gets out of preview.

Thank you for your support. Issue can be closed.

MarkusDressel on May 4, 2022

@jackphillipsjmu and @MarkusDressel We are still investigating this Issue, Apologies for the delay!

RakeshMohanMSFT on Apr 28, 2022