arrow: [Python][GcsFileSystem][Parquet] fails to create ParquetFile from GCS after a few hundred files
Describe the bug, including details regarding any error messages, version, and platform.
I posted the question on SO https://stackoverflow.com/questions/76012391/pyarrow-fails-to-create-parquetfile-from-blob-in-google-cloud-storage
My guess about the issue is either GcsFileSystem or its interaction with GCS. I don’t have code snippet to reproduce the issue. For me it happens after looping through 300+ files. After that, the issue seems to persist.
The gist of it is using biglist.ParquetFileReader.load_file
- if
lazy=False
, it works fine. - if
lazy=True
, after 300+ files, it starts to fail with
File "/usr/local/lib/python3.10/dist-packages/pyarrow/parquet/core.py", line 319, in __init__
source = filesystem.open_input_file(source)
File "pyarrow/_fs.pyx", line 770, in pyarrow._fs.FileSystem.open_input_file
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 138, in pyarrow.lib.check_status
pyarrow.lib.ArrowException: Unknown error: google::cloud::Status(UNKNOWN: Permanent error GetObjectMetadata: WaitForHandles(): unexpected error code in curl_multi_*, [12]=Unrecoverable error in select/poll)
Component(s)
Parquet, Python
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 22 (12 by maintainers)
Opened https://github.com/apache/arrow/issues/35879
A new
google-cloud-cpp
version has been released with the fix: https://github.com/googleapis/google-cloud-cpp/releases/tag/v2.11.0