azure-sdk-for-python: Dataset.download() hangs for a long time and, when done, some files are missing in the destination folder

Package Name: azureml-core | azureml-dataset-runtime
Package Version: 1.36.0.post2 | 1.36.0
Operating System: Windows 10, macOS
Python Version: 3.8

Describe the bug I tried to use Dataset.download() method to download a registered dataset (made of multiple files) in my personal computer (Windows 10, ~50Mbps connection). For small test datasets (a few MBs), it works as expected. For bigger datasets (~3GB) the download hangs or it terminates after a long time with no exception or logging errors. Furthermore, some of the files are missing in the target folder. The same happens in the macOS laptop of my colleague.

Everything works properly in my Azure ML virtual machine (running Linux).

To Reproduce

Register a dataset from a datastore.
Try to download it with

from azureml.core import Workspace, Dataset

# Fill workspace arguments
workspace = Workspace.get(
    name="",
    subscription_id="",
    resource_group="", 
    auth=InteractiveLoginAuthentication(tenant_id="")
)

dataset= Dataset.get_by_name(workspace, "<dataset-name>")
dataset.download("your/target/path")

Expected behavior The dataset is downloaded to “your/target/path”.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 18 (6 by maintainers)

Commits related to this issue

CodeGen from PR 21878 in Azure/azure-rest-api-specs Vidit msft/update database api version v5 (#21878) * update database api version to 2022-05-01-preview * rearrange * rearrange specs file — committed to azure-sdk/azure-sdk-for-python by deleted user 2 years ago

Most upvoted comments

@janluke Sorry for the delay in getting back to you. This is definitely neither expected nor something that we have seen before. Given that you can repro this across machines and users, I think the issue is in your specific set of files and setup. To help investigate this better could you please share some additional info with me:

What is the datastore type used in this scenario (Azure Blob, Azure Data Lake Gen 2, Azure File Share)?
how is the file dataset defined? Ideally, you could post output of dataset._dataflow._steps in your envronment
what is the structure of the source datastore in terms of number of files and folders?
have you configured and VPNs, custom proxys or crednentialless datastores?
and last but not least please share you telemtry session id by running before a new attempt at download (would let us check telemetry on our side)

from azureml._base_sdk_common import _ClientSessionId
print(_ClientSessionId)

Thanks !

anliakho2 on Jan 3, 2022