azure-sdk-for-python: VERY slow large blob downloads

I am confused about how to optimize BlobClient for downloading large blobs (up to 100 GB).

For example, on a ~480 MB blob the following code takes around 4 minutes to execute:

full_path_to_file = '{}/{}'.format(staging_path,blob_name)
blob = BlobClient.from_connection_string(conn_str=connection_string, container_name=container_name, blob_name=blob_name)
with open(full_path_to_file, "wb") as my_blob:
    download_stream = blob.download_blob()
    result = my_blob.write(download_stream.readall())

In the previous version of the SDK I was able to specify a max_connections parameter that sped download significantly. This appears to have been removed (along with progress callbacks, which is annoying). I have files upwards of 99 GB which will take almost 13 hours to download at this rate, whereas I used to be able to download similar files in under two hours.

How can I optimize the download of large blobs?

Thank you!

Edit: I meant that it took 4 minutes to download a 480 megabyte file. Also, I am getting memory errors when trying to download larger files (~40 GB).

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 23 (7 by maintainers)

Most upvoted comments

I experienced timeouts on larger downloads as well >100GB commonly and >200GB would always fail, when using .readall(), more on that below. Of note, max_concurrency did NOT resolve this for me. For me it seems that the Auth header timestamp got older than the accepted 25 minute age limit. So the client isn’t updating the header automatically. I was able to work around it, in a ugly manner.

  1. Download in 1GB Range-Based Chunking download_blob(offset=start, length=end).download_to_stream(MemBlob, max_concurrency=12)
  2. Overwrite the retry settings, BlobServiceClient.from_connection_string(<here>), immediately fail (might be the cause of the timeout to begin with)
  3. Validate the segment size is the size received
  4. If an exception is thrown or the segment not the expected size (last segment will be almost always be smaller of course) then reauth and retry the last segment again

Rinse and repeat till the download completes. Note I build a checksum as I download since I know the checksum of the original file so I have high confidence of file integrity and validate at the end. Performance wise on a 1Gbps link for a single blob out of cool storage I get ~430Mbps / 53.75MB/s. Azure side cool tier is 60MB/s limit or there about so it seems to work pretty well.