s3fs: Deadlock in the interaction between `pyarrow.filesystem.S3FSWrapper` and `s3fs.core.S3FileSystem`
Please be concise with code posted. See guidelines below on how to provide a good bug report:
- Craft Minimal Bug Reports http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports
- Minimal Complete Verifiable Examples https://stackoverflow.com/help/mcve
Bug reports that follow these guidelines are easier to diagnose, and so are often handled much more quickly. –>
What happened: Some interaction between s3fs, pyarrow, and petastorm causes deadlock
What you expected to happen: s3fs to be threadsafe, if pyarrow is using it that way
Minimal Complete Verifiable Example:
import pyarrow.parquet as pq
from petastorm.fs_utils import get_filesystem_and_path_or_paths, normalize_dir_url
dataset_url = 's3://<redacted>'
# Repeat basic steps that make_reader or make_batch_reader normally does
dataset_url = normalize_dir_url(dataset_url)
fs, path = get_filesystem_and_path_or_paths(dataset_url)
# Finished in seconds
dataset = pq.ParquetDataset(path, filesystem=fs, metadata_nthreads=1)
# Hung all night
dataset = pq.ParquetDataset(path, filesystem=fs, metadata_nthreads=10)
# Their code
>>> type(fs)
<class 'pyarrow.filesystem.S3FSWrapper'>
# Your code
>>> type(fs.fs)
<class 's3fs.core.S3FileSystem'>
Anything else we need to know?:
If your code is not threadsafe, that would appear to be news to pyarrow. Also reported to Petastorm. Will be reported to PyArrow.
Environment:
- Dask version:
0.4.2 - Python version:
3.7.8 - Operating System: Mac OS 10.15.6
- Install method (conda, pip, source):
pip install s3fs==0.4.2
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 22 (5 by maintainers)
Reported as ARROW-10029.