azure-sdk-for-python: [Azure ML SDK v2] File is not written to output azureml datastore

Package Name: azure.ai.ml
Package Version: latest in Azure ML Notebooks (Standard)
Operating System: Azure ML Notebooks (Standard)
Python Version: Azure ML Notebooks (Standard)

Describe the bug The Azure ML datastore tfconfigs has multiple files in the base path.

For a pipeline job the Azure ML Datastore tfconfig is defined as an output to write data:

update_config_component = command(
    name="tf_config_update",
    display_name="Tensorflow configuration file update",
    description="Reads the pipeline configuration file from a specific model (directory), updates it with the params, and saves the new pipleine config file to the output directory",
    inputs=dict(
        config_dir=Input(type="uri_folder"),
        config_directory_name=Input(type="string"),
        images_dir=Input(type="uri_folder"),
        labelmap_path=Input(type="string"),
        fine_tune_checkpoint_type=Input(type="string"),
        fine_tune_checkpoint=Input(type="string"),
        train_record_path=Input(type="string"),
        test_record_path=Input(type="string"),
        num_classes=Input(type="integer"),
        batch_size=Input(type="integer"),
        num_steps=Input(type="integer"),
    ),
    outputs = {
            "config_directory_output": Output(
                type=AssetTypes.URI_FOLDER,
                path=f"azureml://subscriptions/{ml_client.subscription_id}/resourcegroups/{ml_client.resource_group_name}/workspaces/{ml_client.resource_group_name}/datastores/tfconfigs/paths/",
        )
    },
    # The source folder of the component
    code=update_config_src_dir,
    command="""pwd && ls -la ${{outputs.config_directory_output}} && python update.py \
               --config_dir ${{inputs.config_dir}} \
               --config_directory_name ${{inputs.config_directory_name}} \
               --config_directory_output ${{outputs.config_directory_output}} \
               --images_dir ${{inputs.images_dir}} \
               --labelmap_path ${{inputs.labelmap_path}} \
               --fine_tune_checkpoint_type ${{inputs.fine_tune_checkpoint_type}} \
               --fine_tune_checkpoint ${{inputs.fine_tune_checkpoint}} \
               --train_record_path ${{inputs.train_record_path}} \
               --test_record_path ${{inputs.test_record_path}} \
               --num_classes ${{inputs.num_classes}} \
               --batch_size ${{inputs.batch_size}} \
               --num_steps ${{inputs.num_steps}} \
            """,
    environment="azureml://registries/azureml/environments/AzureML-minimal-ubuntu18.04-py37-cpu-inference/versions/43",
)

The output config_directory_output is mounted the computing engine execution as follows:

/mnt/azureml/cr/j/6a153baacc664cada4060f0b95adbf0e/cap/data-capability/wd/config_directory_output

At the beginning of the python-script the output directory is listed as follows:

print("Listing path / dir: ", args.config_directory_output)
arr = os.listdir(args.config_directory_output)
print(arr)

The directory does not include any files:

Listing path / dir:  /mnt/azureml/cr/j/6a153baacc664cada4060f0b95adbf0e/cap/data-capability/wd/config_directory_output
[]

BUG: The Azure ML Datastore tfconfig mounted as an output includes multiple files already uploaded manually.

At the end of the python script a config-file is written to the mounted output and the directiry is listed again as follows:

with open(pipeline_config_path, "r") as f:
    config = f.read()

with open(new_pipeline_config_path, 'w') as f:

  # Set labelmap path
  config = re.sub('label_map_path: ".*?"', 
             'label_map_path: "{}"'.format(images_dir_labelmap_path), config)

  # Set fine_tune_checkpoint path
  config = re.sub('fine_tune_checkpoint_type: ".*?"',
                  'fine_tune_checkpoint_type: "{}"'.format(args.fine_tune_checkpoint_type), config)  

  # Set fine_tune_checkpoint path
  config = re.sub('fine_tune_checkpoint: ".*?"',
                  'fine_tune_checkpoint: "{}"'.format(args.fine_tune_checkpoint), config)
  
  # Set train tf-record file path
  config = re.sub('(input_path: ".*?)(PATH_TO_BE_CONFIGURED/train)(.*?")', 
                  'input_path: "{}"'.format(images_dir_train_record_path), config)
  
  # Set test tf-record file path
  config = re.sub('(input_path: ".*?)(PATH_TO_BE_CONFIGURED/val)(.*?")', 
                  'input_path: "{}"'.format(images_dir_test_record_path), config)
  
  # Set number of classes.
  config = re.sub('num_classes: [0-9]+',
                  'num_classes: {}'.format(args.num_classes), config)
  
  # Set batch size
  config = re.sub('batch_size: [0-9]+',
                  'batch_size: {}'.format(args.batch_size), config)
  
  # Set training steps
  config = re.sub('num_steps: [0-9]+',
                  'num_steps: {}'.format(int(args.num_steps)), config)
    
  f.write(config)

# List directory
print("Listing path / dir: ", args.config_directory_output)
arr = os.listdir(args.config_directory_output)
print(arr)

The listing directory of the mounted output is as follows:

Listing path / dir:  /mnt/azureml/cr/j/6a153baacc664cada4060f0b95adbf0e/cap/data-capability/wd/config_directory_output
['ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8_steps125000_batch16.config']

BUG: The mounted output directory includes now a file. But the Azure ML Datastore does not include the new written file seen in the Azure Explorer / Azure Portal GUI.

To Reproduce Steps to reproduce the behavior:

Create a new Azure ML Datastore with a new container the storage account
Create pipeline with a job and the output is the new created Azure ML Datastore
Write a file to the output in a pipeline job
Run the pipeline
Confirm that the file is not created in the Azure ML Datastore / Azure Storage Blob Container

Expected behavior Any file written an output Azure ML Datastore in a python-job should be written to the underlying Azure Storage Blob Container and can be used later.

Additional context Using the following tutorials as reference:

https://github.com/Azure/azureml-examples/blob/main/sdk/python/assets/data/data.ipynb -> Reading and writing data in a job

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 1
Comments: 23 (7 by maintainers)

Most upvoted comments

Hi @D-W- ,

I am still facing this issue in the azure-ai-ml 1.7.2. Any Update on the permanent fix?

apthagowda97 on May 28, 2023

Thanks hopefully the api can be improved soon, I found this particularly hard to discover.

You have this api already: “blobstore = ml_client.datastores.get(name=‘nasfacemodels’)”

Just make it usable in the Input and Output path would be great. Or better yet you could add a “store” parameter so I can do this:

Input(type=“blob_store”, store=ml_client.datastores.get(name=‘nasfacemodels’), path=‘models/Deci2’)

Then it would be even more clear that you CAN create a connection between pipeline inputs and outputs and azure data stores…

lovettchris on Mar 7, 2023

Specifing output path during defining component will not work and still use default path azureml://datastores/${{default_datastore}}/paths/azureml/${{name}}/${{output_name}}/

However, specifing output path during component consumption in pipeline is supported with code like below:

# in a pipeline
node = component(<component-args>)
node.outputs.output = Output(
    type="uri_folder", mode="rw_mount", path=custom_path
)

please refer to our sample on this.

zhengfeiwang on Dec 16, 2022