s3fs: Deadlock in the interaction between `pyarrow.filesystem.S3FSWrapper` and `s3fs.core.S3FileSystem`

Please be concise with code posted. See guidelines below on how to provide a good bug report:

Bug reports that follow these guidelines are easier to diagnose, and so are often handled much more quickly. –>

What happened: Some interaction between s3fs, pyarrow, and petastorm causes deadlock

What you expected to happen: s3fs to be threadsafe, if pyarrow is using it that way

Minimal Complete Verifiable Example:

import pyarrow.parquet as pq
from petastorm.fs_utils import get_filesystem_and_path_or_paths, normalize_dir_url

dataset_url = 's3://<redacted>'

# Repeat basic steps that make_reader or make_batch_reader normally does
dataset_url = normalize_dir_url(dataset_url)
fs, path = get_filesystem_and_path_or_paths(dataset_url)

# Finished in seconds
dataset = pq.ParquetDataset(path, filesystem=fs, metadata_nthreads=1)
# Hung all night
dataset = pq.ParquetDataset(path, filesystem=fs, metadata_nthreads=10)

# Their code
>>> type(fs)
<class 'pyarrow.filesystem.S3FSWrapper'>
# Your code
>>> type(fs.fs)
<class 's3fs.core.S3FileSystem'>

Anything else we need to know?:

If your code is not threadsafe, that would appear to be news to pyarrow. Also reported to Petastorm. Will be reported to PyArrow.

Environment:

  • Dask version: 0.4.2
  • Python version: 3.7.8
  • Operating System: Mac OS 10.15.6
  • Install method (conda, pip, source): pip install s3fs==0.4.2

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 22 (5 by maintainers)

Most upvoted comments

Reported as ARROW-10029.