arrow: [Python][GcsFileSystem][Parquet] fails to create ParquetFile from GCS after a few hundred files

Describe the bug, including details regarding any error messages, version, and platform.

I posted the question on SO https://stackoverflow.com/questions/76012391/pyarrow-fails-to-create-parquetfile-from-blob-in-google-cloud-storage

My guess about the issue is either GcsFileSystem or its interaction with GCS. I don’t have code snippet to reproduce the issue. For me it happens after looping through 300+ files. After that, the issue seems to persist.

The gist of it is using biglist.ParquetFileReader.load_file

  • if lazy=False, it works fine.
  • if lazy=True, after 300+ files, it starts to fail with
    File "/usr/local/lib/python3.10/dist-packages/pyarrow/parquet/core.py", line 319, in __init__
      source = filesystem.open_input_file(source)
    File "pyarrow/_fs.pyx", line 770, in pyarrow._fs.FileSystem.open_input_file
    File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
    File "pyarrow/error.pxi", line 138, in pyarrow.lib.check_status
  pyarrow.lib.ArrowException: Unknown error: google::cloud::Status(UNKNOWN: Permanent error GetObjectMetadata: WaitForHandles(): unexpected error code in curl_multi_*, [12]=Unrecoverable error in select/poll)

Component(s)

Parquet, Python

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 22 (12 by maintainers)

Commits related to this issue

Most upvoted comments