arrow: [Python] Writing to cloudflare r2 fails for mutlipart upload

Describe the bug, including details regarding any error messages, version, and platform.

When I try to write a pyarrow.table to cloudflare r2 object store, I got an error when files are larger than x MB (I do not know the exact threshold) and pyarrow internally switches to multipart uploading.

I´ve used s3fs.S3FileSystem (fsspec) and also tried pyarrow.fs.S3FileSystem. Here is some example code:

import pyarrow.fs as pafs
import s3fs
import pyarrow.parquet as pq
import pyarrow as pa

table_small = pq.read_table("small_data.parquet")
table_large = pq.read_table("large_data.parquet")

fs1 = s3fs.S3FileSystem(
  key="some_key", 
  secret="some_secret", 
  client_kwargs=dict(endpoint_url="https://123456.r2.cloudflarestorage.com"), 
  s3_additional_kwargs=dict(ACL="private") # <- this is neccessary for writing. 
) 

fs2 = pafs.S3FileSystem(
  access_key="some_key", 
  secret_key="some_secret", 
  endpoint_override="https://123456.r2.cloudflarestorage.com"
)


pq.write_table(table_small, "test/test.parquet", filesystem=fs1) # <- works 
pq.write_table(table_small, "test/test.parquet", filesystem=fs2) # <- works 

#  failed with OSError: [Errno 22] There was a problem with the multipart upload. 
pq.write_table(table_large, "test/test.parquet", filesystem=fs1) 

# failed with OSError: When initiating multiple part upload for key 'test.parquet' in bucket 'test': AWS Error NETWORK_CONNECTION during CreateMultipartUpload operation: curlCode: 28, Timeout was reached
pq.write_table(table_large, "test/test.parquet", filesystem=fs2) 

Platform

Linux x86

Versions

pyarrow 11.0.0 s3fs 2023.1.0

Component(s)

Parquet, Python

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17 (10 by maintainers)

Commits related to this issue

Most upvoted comments

This seems like a legitimate request and pretty workable. We are pretty close already. The code in ObjectOutputStream is roughly…

if request > part_limit:
  submit_request(request)
  return
buffer.append(request)
if buffer > part_limit:
  submit_request(buffer)
  buffer.reset()

Given we are already talking about cloud upload and I/O I think we can just directly implement the equal parts approach (instead of trying to maintain both) without too much hit to performance (though there will be some hit since this introduces a mandatory extra copy of the data in some cases). This would change the above logic to:

buffer.append(request)
for chunk in slice_off_whole_chunks(buffer, part_limit):
 submit_request(chunk)

Does anyone want to create a PR?

I’m looking into this now.

(For context: I work at Cloudflare)

Here’s the latest:

  • R2 currently requires that: all parts are of equal size, a maximum of 10K parts, and per-part limit of 5GB, and a total file size of 5TB.
  • DuckDB updated their custom http-fs implementation to work with this
  • We don’t yet have any concrete plans to relax this in the medium term: s3fs would need to support the ability (or a client using it) to work within R2’s requirements

This doesn’t mean we’re not open to relaxing this, but it’s a non-trivial change.

I´d like to find out what causes this error. Is it possible to run pyarrow commands in a “debugging mode” to get more details?

Try running this before you do anything (before you import pyarrow.fs):

import pyarrow._s3fs
pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Trace)

You can also try log levels Debug, Info, Warn. I think it logs to stdout or stderr.