MachineLearningNotebooks: The `run.input_datasets` dictionary is empty - even after passing into the PythonScriptStep

The run.input_datasets dictionary is empty - even after passing into the PythonScriptStep.

Pipeline.ipynb

input_dataset = Dataset.get_by_name(ws, name='super_secret_data')

cleanStep = PythonScriptStep(
    script_name = "clean.py",
    inputs = [input_dataset.as_named_input('important_dataset')],
    outputs = [output_data],
    compute_target = cpu_cluster,
    source_directory = experiment_folder
)

clean.py

run = Run.get_context()
print(run.input_datasets)

input_ds = run.input_datasets['important_dataset']
input_df = input_ds.to_pandas_dataframe()

When the pipeline is run, the log for the clean.py step shows the run.input_datasets object is an empty dict and therefore the script fails with a KeyError.

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 1
Comments: 29 (8 by maintainers)

Links to this issue

Non-interactive login to registered dataset - Microsoft Q&A

Most upvoted comments

@MayMSFT Ok, thanks!

Perhaps it’s good to add a note with the run.input_datasets that the attribute input_datasets remains empty?

The first thing I tried was to use this information to register the dataset to the model.

grjzwaan on Jun 12, 2020

I’m also having this issue (Azure ML SDK Version: 1.6.0)

No errors in 70_driver_log.txt
On ml.azure.com the dataset is listed
run.get_details()['inputDatasets'] shows the datasets that I gave as inputs
run.input_datasets is {}
run.register_model() registers the model without reference to the input datasets.

The above happens regardless of local or a compute instance in azure.

@MayMSFT apologies if this is not the correct place to raise this issue.

grjzwaan on Jun 10, 2020

thanks Anders. This was caused by a bug in our code. For dataset.as_named_input(), passing string with capital letter will cause the error. We will fix it on Feb 17 release. The current walkaround is to use small letters only.

MayMSFT on Jan 31, 2020

@ezwiefel Based on the driver log, it looks like the code that was supposed to set up input_datasets is not run. Can you please paste the code that shows how you set up the conda dependencies? I don’t see you passing in a run configuration, which is where you would specify the conda dependencies, to the PythonScriptStep.

Hi, could you please look into code in comments above? I have added dependencies mentioned https://github.com/Azure/MachineLearningNotebooks/issues/707#issuecomment-567585408.

joshisn26 on Jan 2, 2020

Hi,

I am facing the same issue. I am using TabularDataset. Installed below dependencies:

env = Environment('my_env')
cd = CondaDependencies.create(pip_packages=['tensorflow==1.12.0','keras==2.2.4','azureml-sdk','azureml-defaults','matplotlib', 'scikit-learn', 'azureml-dataprep[pandas,fuse]>=1.1.14'])
env.python.conda_dependencies = cd


 est = TensorFlow(source_directory=script_folder,
                 script_params=script_params,
                 compute_target=compute_target, 
                 inputs=[ds.as_named_input('my_data')],
                 entry_script='keras_lstm.py', 
                 environment_definition= env)

Script: dataset = run.input_datasets["my_data"]

Error:

return super().__getitem__(key)
KeyError: 'my_data'

Could someone please share solution if any?

Thanks, SJ

joshisn26 on Jan 2, 2020