aws-sdk-pandas: Slow performance when using read_parquet from s3
Hi,
I would like to open an issue as we have seen quite unsatisfying performance using the read_parquet
function. This is our setup and data below:
- data is in S3, there are 1164 individual date-time prefixes under the main folder, and the total size of all files is barely 25.6 MB. So quite a lot of small individual files organized by individual prefixes by date-time
- they way we gather these files is by passing path:
s3://.../
and using thepartition_filter
. The function call looks like this:
wr.s3.read_parquet(
path,
dataset=True,
partition_filter=filter(),
)
I’ve run a couple of tests to verify whether there would be any speed improvement if I passed a list of prefixes for the function to combine instead of using the partition_filter
but the gain was marginal. Enabling use_threads=True
gave no improvement. Overall it takes around 13 minutes to collect all files… this is just too long. Downloading them with aws sync
takes a few seconds.
Our main use case for operating on streams is in AWS Batch. We have some data loaders that use the data wrangler when we train our ML model in AWS Batch. We realized after some time that the main contributor to an extended training time, is the part where the data is collected from AWS using the data wrangler (primarily the wr.s3.read_parquet
). Please also note that we’re not taking of big data here. Most of our use cases is like described above.
At the moment we’re wondering whether this can be optimized or if we should move away from the streaming approach, and simply download the data on the container for model training. Could you give some advice? What’s your take on that?
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (8 by maintainers)
@konradsemsch Happy to hear that 🙂 I’ve now also created a PR adressing the issue. It is now also possible to directly set the number of used threads via the
use_threads
parameter which might also be able to increase your speed performance even further 😉. e.g.Even when using use_threads = True, loading and writing data with awswrangler is extraordinarily slow. I have data partitioned by day and awswrangler takes at least 10x longer to read data than directly loading the parquet files.
FWIW ran this through a fairly large dataset as well and saw a ~20% speedup and a 6% increase in memory usage 🎉
This was for a single partition that’s part of a parquet dataset, with chunked compressed files (12 chunks within the partition) on a machine with 48 CPUs and a 10Gbps network interface (r5.12xl).
Before:
After:
In which setup/environment are you using
awswrangler
. If it’s something like a small EC2 instance, Lambda, etc. which might only have 1 or 2 cpus, it might be worth setting the number of threads directly like:@maxispeicher tnx, works 😃 general question: I don’t see improvement in terms of times, I suspect it is because having large amount of small files ~5KB and using
wr.s3.read_parquet
with list of paths 1-50 called from a loop - attached code:can you suggest additional tunings for the wrangler?
@maxspeicher, I’ve slotted time to test this this week. I’ll report back.