gcsfs: gcsfs==0.6.1|0.6.2 'walk()' method breaking dask dataframe
What happened:
While walking the root of a parquet folder initially created by pyspark, the fs.walk method returns an empty string '' in the files list.
('path/to/parquet/folder',
['Year=2019', 'Year=2020'],
['', '_SUCCESS'])
This behavior is breaking dask.dataframe.read_parquet('gs://...') on multiple occasions (let me know if you want these errors), that’s when I tracked the error down to fs.walk.
What you expected to happen:
The correct output should be
('path/to/parquet/folder',
['Year=2019', 'Year=2020'],
[ '_SUCCESS'])
Minimal Complete Verifiable Example:
import gcsfs
next(gcsfs.GCSFileSystem().walk('gs://path/to/parquet/folder/'))
Anything else we need to know?:
Reverting to gcsfs==0.6.0, seemed to solve this problem. As far as I tested, the problem happens with 0.6.1 and 0.6.2 versions.
Environment:
- Dask version: 2.21.0
- GCSFS version: 0.6.2
- Python version: 3.7.6
- Operating System: Ubuntu 18.04
- Install method (conda, pip, source): pip
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 16 (9 by maintainers)
Can you check on gcsfs master, please?