azure-cli: azure-cli v2 - SweepJob fails when input is a registered dataset
Describe the bug When running a SweepJob (or any other job like commandJob) where inputs should be a registered dataset within the same workspace, the job gets submitted by CLI but at the UI, i get the following error.
{'code': data-capability.data-capability.FailedToInitializeCapabilitySession.start.UserErrorException, 'message': , 'target': , 'category': UserError, 'error_details': [{'key': NonCompliantReason, 'value': UserErrorException:
Message: only eval_mount or eval_download modes are supported for v1 legacy dataset for mltable
InnerException None
ErrorResponse
{
"error": {
"code": "UserError",
"message": "only eval_mount or eval_download modes are supported for v1 legacy dataset for mltable"
}
}}, {'key': StackTrace, 'value': File "/opt/miniconda/envs/data-capability/lib/python3.7/site-packages/data_capability/capability_session.py", line 128, in from_cr_config
Furthermore, the dataset is not recognized as an input when looking at the summary in the UI, I assume the schema valid for CLI does not match with the one used in the workspace ?
The registered dataset is a file dataset.
Here is the yaml file I am using:
$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
display_name: credit-data-hyper-parameter-sweep-job
experiment_name: credit-data-hyper-parameter-sweep-job
description: Run a hyperparameter sweep job for Sklearn on german credit dataset
sampling_algorithm: random
type: sweep
inputs:
credit:
path: azureml:german-credit-dataset-via-jupyter:1
type: uri_folder
search_space:
n_estimators:
type: choice
values: [50,100,200,300]
criterion:
type: choice
values: [gini, entropy]
class_weight:
type: choice
values: [None, balanced, balanced_subsample]
objective:
primary_metric: f1-score-test
goal: maximize
# invoke completions (Ctrl+Space, Cmd+Space) to see the list of compute targets available
compute: azureml:cpu-cluster3
trial:
code: ./src
command: python train.py
--data-path ${{inputs.credit}}
--data-name german_credit_data.csv
--n_estimators ${{search_space.n_estimators}}
--criterion ${{search_space.criterion}}
--class_weight ${{search_space.class_weight}}
# invoke completions (Ctrl+Space, Cmd+Space) to see the list of environments available
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:15
limits:
max_total_trials: 1
max_concurrent_trials: 1
timeout: 10000
the command to submit the job is:
az ml job create --subscription <MASKED> --resource-group <MASKED>--workspace-name <MASKED>--file path/to/file/hyperparameter_tuning.yml --stream
To Reproduce try to use any registered dataset within a sweepJob configuration
Expected behavior I expect, that the the HyperDrive detects the input dataset correctly.
Environment summary az version { “azure-cli”: “2.35.0”, “azure-cli-core”: “2.35.0”, “azure-cli-telemetry”: “1.0.6”, “extensions”: { “ml”: “2.2.3” } }
Additional context
I do have the same issue with commandJobs. While it failed during submitting it via CLI, I was able to get it up and running when I clicked on “edit and submit” in the UI and just added the input dataset via the UI manually. The final yml of this working solution used the old inputs syntax, where datasets where referenced as datasets: azureml:
rather than path: azureml:
But this can not be used if you want to submit a Job via CLI since it is invalid based on the schema.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 3
- Comments: 21 (9 by maintainers)
Before the March Public Preview release (v2.2.1), registered dataset assets in V2 -
uri_folder
anduri_file
- were infact represented as a V1FileDataset
asset. In the March CLI release (v2.2.1):az ml dataset
subgroup is deprecated, please useaz ml data
instead.uri_folder
anduri_file
are now first-class data V2 entities and they are no longer use the V1FileDataset
entity under-the-covers.FileDataset
andTabularDataset
) are cast to anmltable
. Please see the context section below for more details.These changes introduced a breaking change and existing jobs consuming registered dataset assets will error with the message:
To mitigate this breaking change for registered assets, you have two options articulated below.
Option 1: Re-create dataset assets as data assets (preferred method)
The yaml definition of your data asset will be unchanged, for example it will look like:
Re-create the data asset using the new
az ml data
subground command:The registered data asset is now a bonafide
uri_file
/uri_folder
asset rather than a V1FileDataset
.Option 2: Update job yaml file
Update the inputs section of your job yaml from:
to:
The section below provides more context on why V1 registered assets are mapped to a new type called
mltable
.Context
Prior to CLI v2.2.1 registered
uri_folder
anduri_file
data assets in V2 were actually typed to a V1FileDataset
asset. In V1 of Azure Machine Learning,FileDataset
andTabularDataset
could make use of an accompanying data prep engine to do various data loading transformations on the data - for example, take a sample of files/records, filter files/records, etc. From CLI v2.2.1+ bothuri_folder
anduri_file
are first-class asset types and there is no longer a dependency on V1FileDataset
, these types simply map cloud storage to your compute nodes (via mount or download) and do not provide any data loading transforms such as sample or filter.To provide data loading transforms for both files and tables of data we are introducing a new type in V2 called
mltable
, which maintains the data prep engine. Given that all registered V1 data assets (FileDataset
andTabularDataset
) had a dependency on the data prep engine, we cast them to anmltable
so they can continue to leverage the data prep engine.Whilst backward compatibility is provided (see below), if your intention with your V1
FileDataset
assets was to have a single path to a file or folder with no loading transforms (sample, take, filter, etc) then we recommend that you re-create them as auri_file
/uri_folder
using the V2 CLI:You can get backward compatibility with your registered V1 dataset in an Azure ML V2 job by using the following definition in the
inputs
section of your job yaml:My current assumption is, that azure ml cli v2 introduced new types of datasets (uri-folder and uri file) which can currently only created with the new cli v2 and thus, sweepJob or commandJob are incompatible with ‘legacy’ datasets uploaded via the UI or python sdk.
Ok thanks @Man-MSFT. So if I have a CSV/Parquet file, I cannot mount it as long as it’s a TabularDataset. I have to recreate it as a FileDataset and it will then be mountable?
Ok thanks for the info @diondrapeck. May I ask how I can mount TabularDatasets (SQL Database) in the yaml file then?
I can confirm the backward compatibility with a registered V1 dataset in an Azure ML V2 job (ml extension version 2.3.1) using:
Would be great if you can reflect that in your documentation.
At least for me, this is my preferred solution. While I can understand that Option 1: Re-create dataset assets as data assets is the preferred method, this seems like an overkill assuming you have 200 datasets across your workspaces. I hope, that you will auto-convert legacy V1 datasets to V2 datasets once cli V2 gets out of preview.
Thank you for your support. Issue can be closed.
@jackphillipsjmu and @MarkusDressel We are still investigating this Issue, Apologies for the delay!