arrow: [Parquet][Python] Potential regression in Parquet parallel reading
Describe the enhancement requested
UPDATE: this is looking more like a bug on closer look. What happens:
When calling to_table()
on a FileSystemDataset in Python using pyarrow.fs.S3FileSystem,
- Using https://github.com/apache/arrow/commit/02de3c1789460304e958936b78d60f824921c250, one HEAD request and two GET requests are made for each file. Also the requests are made concurrently.
- With current
main
, there are two HEAD requests and three GET requests for each file. Also, the first HEAD request is made from the main thread so the downloads are started sequentially. I would expect to see only one HEAD request, not sure if the three GET are expected due to some change.
Here’s an example using https://github.com/apache/arrow/commit/02de3c1789460304e958936b78d60f824921c250, reading a FileSystemDataset using fragment_readahead = 100
and io concurrency set to 100; Y-axis represents files and X-axis is time in seconds, and each point is the relative start time of a request (HEAD or GET):
With the current main
https://github.com/apache/arrow/commit/fc8c6b7dc8287c672b62c62f3a2bd724b3835063 it seems that the first request for each file is made from the same thread (blue), and notably there are five requests per each file.
See comment below for reproducible example.
I’m running on Max OS 14.1.
Component(s)
Parquet
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 15 (15 by maintainers)
Commits related to this issue
- GH-38591: [Parquet][C++] Remove redundant open calls in `ParquetFileFormat::GetReaderAsync` (#38621) ### Rationale for this change There were duplicate method calls causing extra I/O operations, a... — committed to apache/arrow by eeroel 8 months ago
- GH-38591: [Parquet][C++] Remove redundant open calls in `ParquetFileFormat::GetReaderAsync` (#38621) ### Rationale for this change There were duplicate method calls causing extra I/O operations, a... — committed to JerAguilon/arrow by eeroel 8 months ago
- GH-38591: [Parquet][C++] Remove redundant open calls in `ParquetFileFormat::GetReaderAsync` (#38621) ### Rationale for this change There were duplicate method calls causing extra I/O operations, a... — committed to loicalleyne/arrow by eeroel 8 months ago
- GH-38591: [Parquet][C++] Remove redundant open calls in `ParquetFileFormat::GetReaderAsync` (#38621) ### Rationale for this change There were duplicate method calls causing extra I/O operations, a... — committed to apache/arrow by eeroel 8 months ago
- GH-38591: [Parquet][C++] Remove redundant open calls in `ParquetFileFormat::GetReaderAsync` (#38621) ### Rationale for this change There were duplicate method calls causing extra I/O operations, a... — committed to dgreiss/arrow by eeroel 8 months ago
I think we can Provide a MockInputStream(with readAsync and read counting) and hardcode an IO-count here. Any change changes the IO count can report the change here.
Also cc @pitrou for any more ideas…
But you can take a quick fixing for this issue.