arrow: [Parquet][Python] Potential regression in Parquet parallel reading

Describe the enhancement requested

UPDATE: this is looking more like a bug on closer look. What happens:

When calling to_table() on a FileSystemDataset in Python using pyarrow.fs.S3FileSystem,

  • Using https://github.com/apache/arrow/commit/02de3c1789460304e958936b78d60f824921c250, one HEAD request and two GET requests are made for each file. Also the requests are made concurrently.
  • With current main, there are two HEAD requests and three GET requests for each file. Also, the first HEAD request is made from the main thread so the downloads are started sequentially. I would expect to see only one HEAD request, not sure if the three GET are expected due to some change.

Here’s an example using https://github.com/apache/arrow/commit/02de3c1789460304e958936b78d60f824921c250, reading a FileSystemDataset using fragment_readahead = 100 and io concurrency set to 100; Y-axis represents files and X-axis is time in seconds, and each point is the relative start time of a request (HEAD or GET): Screenshot 2023-11-05 at 18 43 54

With the current main https://github.com/apache/arrow/commit/fc8c6b7dc8287c672b62c62f3a2bd724b3835063 it seems that the first request for each file is made from the same thread (blue), and notably there are five requests per each file.

Screenshot 2023-11-05 at 18 43 44

See comment below for reproducible example.

I’m running on Max OS 14.1.

Component(s)

Parquet

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 15 (15 by maintainers)

Commits related to this issue

Most upvoted comments

I think we can Provide a MockInputStream(with readAsync and read counting) and hardcode an IO-count here. Any change changes the IO count can report the change here.

Also cc @pitrou for any more ideas…

But you can take a quick fixing for this issue.