gcsfs: gcsfs==0.6.1|0.6.2 'walk()' method breaking dask dataframe

What happened: While walking the root of a parquet folder initially created by pyspark, the fs.walk method returns an empty string '' in the files list.

('path/to/parquet/folder',
 ['Year=2019', 'Year=2020'],
 ['', '_SUCCESS'])

This behavior is breaking dask.dataframe.read_parquet('gs://...') on multiple occasions (let me know if you want these errors), that’s when I tracked the error down to fs.walk.

What you expected to happen:

The correct output should be

('path/to/parquet/folder',
 ['Year=2019', 'Year=2020'],
 [ '_SUCCESS'])

Minimal Complete Verifiable Example:

import gcsfs
next(gcsfs.GCSFileSystem().walk('gs://path/to/parquet/folder/'))

Anything else we need to know?:

Reverting to gcsfs==0.6.0, seemed to solve this problem. As far as I tested, the problem happens with 0.6.1 and 0.6.2 versions.

Environment:

  • Dask version: 2.21.0
  • GCSFS version: 0.6.2
  • Python version: 3.7.6
  • Operating System: Ubuntu 18.04
  • Install method (conda, pip, source): pip

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 16 (9 by maintainers)

Most upvoted comments

Can you check on gcsfs master, please?