azure-sdk-for-python: Listing blobs names is very slow

  • Package Name: azure-storage-blob
  • Package Version: 12.8.1
  • Operating System: linux (Debian 9)
  • Python Version: 3.8

Describe the bug

For containers containing many blobs, listing blob names takes a lot of time and uses a lot of CPU, while it was fast for azure-storage-blob 2.1.0

To Reproduce Steps to reproduce the behavior:

  1. create a blob store container with at least 5000 blobs (i.e. maxresults for a single page returned by the list blob API).
  2. use azure-storage-blob 2.1.0, list_blob_names to list the blob names for this container and write down the CPU time it takes (for my machine, it’s 376ms).
  3. use azure-storage-blob 12.8.1. Unfortunately, it does not have a list_blob_names function, so I have to use list_blobs and access blob.name for the result. Again, write down the CPU time this takes (for my machine, it’s 2760ms)
  4. Compare the CPU times from 2. and 3. and note the large factor (more than 7 for my case)

Expected behavior

There is a way of listing blob names in azure-storage-blob 12.X that has similar performance as in azure-storage-blob 2.X.

Additional context

This might not be relevant for contains with a few thousand blobs. However, we have containers with a few hundred thousand to a million blobs, and bookkeeping operations that rely on listing the blob content that used to consume a little more than 1 minute CPU time for azure-storage-blob now consume take almost 10 minutes, which is a significant contribution to the runtime of these tasks.

azure-storage-blob 2.X was affected by a similar problem which has been addressed via https://github.com/Azure/azure-storage-python/pull/545 See also additional context there, in particular the use cases listed in https://github.com/Azure/azure-storage-python/pull/545#issuecomment-457504607

I would be willing to contribute an according patch. However, it seems azure-storage-blobs 12.X uses a different way to deserialize the xml response that makes it harder to customize deserialization for azure-storage-blob 12.X, compared to 2.X.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 24 (15 by maintainers)

Commits related to this issue

Most upvoted comments

I’m happy to share that we finally have opened #25747 to add list_blob_names to the Track2 Blob SDK. This API, like the Track1 equivalent, will call the standard List Blobs API but only parse and return the blob names which results in a significant speedup over the traditional list_blobs API when only names are desired.

Some initial perf testing results show this API is 1.5-14 times faster than the existing list_blobs depending on the number of blobs in the container. (The more blobs the larger the performance increase). See comment in #25747 for details. Specifically for the OP’s case of 5000 blobs, we expect somewhere around 8-10x improvement.

We will work to get this merged and released ASAP. Thanks all for your patience and apologies for the long delay on getting this in.

@tasherif-msft: Is adding list_blob_names back to the SDK still under consideration?

@tasherif-msft can you sync with @annatisch about this

Thanks for your patience @jochen-ott-by! We’ve been chatting about the best approach, and are considering tackling this by addressing both options 1 & 3 in my post above (where option 3 takes the form of a separate list_names API, similar to the v2 SDK).

I currently have a working prototype in development here that we have been running perf tests on: https://github.com/Azure/azure-sdk-for-python/pull/19814

The numbers are looking promising, with improvements to listing in general, as well as providing the “names-only” deserialization shortcut. There’s still a fair amount of work to be done to get these strategies “production-ready”, as they dig quite deep into the HTTP pipeline code, and will need thorough testing - so I cannot give you a concrete timeframe when they will land in a release at this point.

We will keep the thread open and updated as we progress. Thanks again for the report!