azure-sdk-for-python: Listing blobs names is very slow
- Package Name: azure-storage-blob
- Package Version: 12.8.1
- Operating System: linux (Debian 9)
- Python Version: 3.8
Describe the bug
For containers containing many blobs, listing blob names takes a lot of time and uses a lot of CPU, while it was fast for azure-storage-blob 2.1.0
To Reproduce Steps to reproduce the behavior:
- create a blob store container with at least 5000 blobs (i.e. maxresults for a single page returned by the list blob API).
- use azure-storage-blob 2.1.0,
list_blob_names
to list the blob names for this container and write down the CPU time it takes (for my machine, it’s 376ms). - use azure-storage-blob 12.8.1. Unfortunately, it does not have a list_blob_names function, so I have to use
list_blobs
and accessblob.name
for the result. Again, write down the CPU time this takes (for my machine, it’s 2760ms) - Compare the CPU times from 2. and 3. and note the large factor (more than 7 for my case)
Expected behavior
There is a way of listing blob names in azure-storage-blob 12.X that has similar performance as in azure-storage-blob 2.X.
Additional context
This might not be relevant for contains with a few thousand blobs. However, we have containers with a few hundred thousand to a million blobs, and bookkeeping operations that rely on listing the blob content that used to consume a little more than 1 minute CPU time for azure-storage-blob now consume take almost 10 minutes, which is a significant contribution to the runtime of these tasks.
azure-storage-blob 2.X was affected by a similar problem which has been addressed via https://github.com/Azure/azure-storage-python/pull/545 See also additional context there, in particular the use cases listed in https://github.com/Azure/azure-storage-python/pull/545#issuecomment-457504607
I would be willing to contribute an according patch. However, it seems azure-storage-blobs 12.X uses a different way to deserialize the xml response that makes it harder to customize deserialization for azure-storage-blob 12.X, compared to 2.X.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 24 (15 by maintainers)
I’m happy to share that we finally have opened #25747 to add
list_blob_names
to the Track2 Blob SDK. This API, like the Track1 equivalent, will call the standard List Blobs API but only parse and return the blob names which results in a significant speedup over the traditionallist_blobs
API when only names are desired.Some initial perf testing results show this API is 1.5-14 times faster than the existing
list_blobs
depending on the number of blobs in the container. (The more blobs the larger the performance increase). See comment in #25747 for details. Specifically for the OP’s case of 5000 blobs, we expect somewhere around 8-10x improvement.We will work to get this merged and released ASAP. Thanks all for your patience and apologies for the long delay on getting this in.
@tasherif-msft: Is adding
list_blob_names
back to the SDK still under consideration?@tasherif-msft can you sync with @annatisch about this
Thanks for your patience @jochen-ott-by! We’ve been chatting about the best approach, and are considering tackling this by addressing both options 1 & 3 in my post above (where option 3 takes the form of a separate list_names API, similar to the v2 SDK).
I currently have a working prototype in development here that we have been running perf tests on: https://github.com/Azure/azure-sdk-for-python/pull/19814
The numbers are looking promising, with improvements to listing in general, as well as providing the “names-only” deserialization shortcut. There’s still a fair amount of work to be done to get these strategies “production-ready”, as they dig quite deep into the HTTP pipeline code, and will need thorough testing - so I cannot give you a concrete timeframe when they will land in a release at this point.
We will keep the thread open and updated as we progress. Thanks again for the report!