azure-sdk-for-python: Listing blobs names is very slow

Package Name: azure-storage-blob
Package Version: 12.8.1
Operating System: linux (Debian 9)
Python Version: 3.8

Describe the bug

For containers containing many blobs, listing blob names takes a lot of time and uses a lot of CPU, while it was fast for azure-storage-blob 2.1.0

To Reproduce Steps to reproduce the behavior:

create a blob store container with at least 5000 blobs (i.e. maxresults for a single page returned by the list blob API).
use azure-storage-blob 2.1.0, list_blob_names to list the blob names for this container and write down the CPU time it takes (for my machine, it’s 376ms).
use azure-storage-blob 12.8.1. Unfortunately, it does not have a list_blob_names function, so I have to use list_blobs and access blob.name for the result. Again, write down the CPU time this takes (for my machine, it’s 2760ms)
Compare the CPU times from 2. and 3. and note the large factor (more than 7 for my case)

Expected behavior

There is a way of listing blob names in azure-storage-blob 12.X that has similar performance as in azure-storage-blob 2.X.

Additional context

This might not be relevant for contains with a few thousand blobs. However, we have containers with a few hundred thousand to a million blobs, and bookkeeping operations that rely on listing the blob content that used to consume a little more than 1 minute CPU time for azure-storage-blob now consume take almost 10 minutes, which is a significant contribution to the runtime of these tasks.

azure-storage-blob 2.X was affected by a similar problem which has been addressed via https://github.com/Azure/azure-storage-python/pull/545 See also additional context there, in particular the use cases listed in https://github.com/Azure/azure-storage-python/pull/545#issuecomment-457504607

I would be willing to contribute an according patch. However, it seems azure-storage-blobs 12.X uses a different way to deserialize the xml response that makes it harder to customize deserialization for azure-storage-blob 12.X, compared to 2.X.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 24 (15 by maintainers)

Commits related to this issue

CodeGen from PR 19755 in Azure/azure-rest-api-specs Revert "Adding status code 202 to Private endpoints PUT (#19125)" (#19755) This reverts commit 5e7603d4591ae39f9c2cedea75c8d97185e0aab2. — committed to azure-sdk/azure-sdk-for-python by deleted user 2 years ago

Most upvoted comments

I’m happy to share that we finally have opened #25747 to add list_blob_names to the Track2 Blob SDK. This API, like the Track1 equivalent, will call the standard List Blobs API but only parse and return the blob names which results in a significant speedup over the traditional list_blobs API when only names are desired.

Some initial perf testing results show this API is 1.5-14 times faster than the existing list_blobs depending on the number of blobs in the container. (The more blobs the larger the performance increase). See comment in #25747 for details. Specifically for the OP’s case of 5000 blobs, we expect somewhere around 8-10x improvement.

We will work to get this merged and released ASAP. Thanks all for your patience and apologies for the long delay on getting this in.

jalauzon-msft on Aug 19, 2022

@tasherif-msft: Is adding list_blob_names back to the SDK still under consideration?

mikeharder on Mar 1, 2022

@tasherif-msft can you sync with @annatisch about this

amishra-dev on Dec 1, 2021

Thanks for your patience @jochen-ott-by! We’ve been chatting about the best approach, and are considering tackling this by addressing both options 1 & 3 in my post above (where option 3 takes the form of a separate list_names API, similar to the v2 SDK).

I currently have a working prototype in development here that we have been running perf tests on: https://github.com/Azure/azure-sdk-for-python/pull/19814

The numbers are looking promising, with improvements to listing in general, as well as providing the “names-only” deserialization shortcut. There’s still a fair amount of work to be done to get these strategies “production-ready”, as they dig quite deep into the HTTP pipeline code, and will need thorough testing - so I cannot give you a concrete timeframe when they will land in a release at this point.

We will keep the thread open and updated as we progress. Thanks again for the report!

annatisch on Jul 26, 2021