azure-sdk-for-python: Dataset.download() hangs for a long time and, when done, some files are missing in the destination folder

  • Package Name: azureml-core | azureml-dataset-runtime
  • Package Version: 1.36.0.post2 | 1.36.0
  • Operating System: Windows 10, macOS
  • Python Version: 3.8

Describe the bug I tried to use Dataset.download() method to download a registered dataset (made of multiple files) in my personal computer (Windows 10, ~50Mbps connection). For small test datasets (a few MBs), it works as expected. For bigger datasets (~3GB) the download hangs or it terminates after a long time with no exception or logging errors. Furthermore, some of the files are missing in the target folder. The same happens in the macOS laptop of my colleague.

Everything works properly in my Azure ML virtual machine (running Linux).

To Reproduce

  1. Register a dataset from a datastore.
  2. Try to download it with
from azureml.core import Workspace, Dataset

# Fill workspace arguments
workspace = Workspace.get(
    name="",
    subscription_id="",
    resource_group="", 
    auth=InteractiveLoginAuthentication(tenant_id="")
)

dataset= Dataset.get_by_name(workspace, "<dataset-name>")
dataset.download("your/target/path")

Expected behavior The dataset is downloaded to “your/target/path”.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 18 (6 by maintainers)

Commits related to this issue

Most upvoted comments

@janluke Sorry for the delay in getting back to you. This is definitely neither expected nor something that we have seen before. Given that you can repro this across machines and users, I think the issue is in your specific set of files and setup. To help investigate this better could you please share some additional info with me:

  1. What is the datastore type used in this scenario (Azure Blob, Azure Data Lake Gen 2, Azure File Share)?
  2. how is the file dataset defined? Ideally, you could post output of dataset._dataflow._steps in your envronment
  3. what is the structure of the source datastore in terms of number of files and folders?
  4. have you configured and VPNs, custom proxys or crednentialless datastores?
  5. and last but not least please share you telemtry session id by running before a new attempt at download (would let us check telemetry on our side)
from azureml._base_sdk_common import _ClientSessionId
print(_ClientSessionId)

Thanks !