arrow: [Python] Writing to cloudflare r2 fails for mutlipart upload
Describe the bug, including details regarding any error messages, version, and platform.
When I try to write a pyarrow.table to cloudflare r2 object store, I got an error when files are larger than x MB (I do not know the exact threshold) and pyarrow internally switches to multipart uploading.
I´ve used s3fs.S3FileSystem
(fsspec) and also tried pyarrow.fs.S3FileSystem
. Here is some example code:
import pyarrow.fs as pafs
import s3fs
import pyarrow.parquet as pq
import pyarrow as pa
table_small = pq.read_table("small_data.parquet")
table_large = pq.read_table("large_data.parquet")
fs1 = s3fs.S3FileSystem(
key="some_key",
secret="some_secret",
client_kwargs=dict(endpoint_url="https://123456.r2.cloudflarestorage.com"),
s3_additional_kwargs=dict(ACL="private") # <- this is neccessary for writing.
)
fs2 = pafs.S3FileSystem(
access_key="some_key",
secret_key="some_secret",
endpoint_override="https://123456.r2.cloudflarestorage.com"
)
pq.write_table(table_small, "test/test.parquet", filesystem=fs1) # <- works
pq.write_table(table_small, "test/test.parquet", filesystem=fs2) # <- works
# failed with OSError: [Errno 22] There was a problem with the multipart upload.
pq.write_table(table_large, "test/test.parquet", filesystem=fs1)
# failed with OSError: When initiating multiple part upload for key 'test.parquet' in bucket 'test': AWS Error NETWORK_CONNECTION during CreateMultipartUpload operation: curlCode: 28, Timeout was reached
pq.write_table(table_large, "test/test.parquet", filesystem=fs2)
Platform
Linux x86
Versions
pyarrow 11.0.0 s3fs 2023.1.0
Component(s)
Parquet, Python
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17 (10 by maintainers)
This seems like a legitimate request and pretty workable. We are pretty close already. The code in ObjectOutputStream is roughly…
Given we are already talking about cloud upload and I/O I think we can just directly implement the equal parts approach (instead of trying to maintain both) without too much hit to performance (though there will be some hit since this introduces a mandatory extra copy of the data in some cases). This would change the above logic to:
Does anyone want to create a PR?
I’m looking into this now.
(For context: I work at Cloudflare)
Here’s the latest:
This doesn’t mean we’re not open to relaxing this, but it’s a non-trivial change.
Try running this before you do anything (before you import
pyarrow.fs
):You can also try log levels
Debug
,Info
,Warn
. I think it logs to stdout or stderr.