dask: Error importing Parquet from HDFS using PyArrow

Hi,

I get the following error when attempting to read a parquet file stored on hdfs:

df = dataframe.read_parquet('hdfs://hdfsnn:8020/user/hdfs/ndsparq/Test.parquet', engine='arrow')
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-3-22d8573a1d6f> in <module>()
----> 1 df = dataframe.read_parquet('hdfs://hdfsnn:8020/user/hdfs/ndsparq/Test.parquet', engine='arrow')

/opt/anaconda3/envs/env0/lib/python3.6/site-packages/dask/dataframe/io/parquet.py in read_parquet(path, columns, filters, categories, index, storage_options, engine)
    293         return _read_pyarrow(fs, paths, file_opener, columns=columns,
    294                              filters=filters,
--> 295                              categories=categories, index=index)
    296 
    297 

/opt/anaconda3/envs/env0/lib/python3.6/site-packages/dask/dataframe/io/parquet.py in _read_pyarrow(fs, paths, file_opener, columns, filters, categories, index)
    164         columns = list(columns)
    165 
--> 166     dataset = api.ParquetDataset(paths)
    167     schema = dataset.schema.to_arrow_schema()
    168     task_name = 'read-parquet-' + tokenize(dataset, columns)

/opt/anaconda3/envs/env0/lib/python3.6/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema)
    619 
    620         (self.pieces, self.partitions,
--> 621          self.metadata_path) = _make_manifest(path_or_paths, self.fs)
    622 
    623         if self.metadata_path is not None:

/opt/anaconda3/envs/env0/lib/python3.6/site-packages/pyarrow/parquet.py in _make_manifest(path_or_paths, fs, pathsep)
    779             if not fs.isfile(path):
    780                 raise IOError('Passed non-file path: {0}'
--> 781                               .format(path))
    782             piece = ParquetDatasetPiece(path)
    783             pieces.append(piece)

OSError: Passed non-file path: hdfs://hdfsnn:8020/user/hdfs/ndsparq/Test.parquet

The same parquet file can be read using PyArrow without issue:

with hdfs.open('/user/hdfs/ndsparq/Test.parquet', 'rb') as f:
    table = pq.read_table(f)
df = table.to_pandas()

Are you able to help? It looks like the filesystem is not passed to api.ParquetDataset in _read_pyarrow().

Thanks very much!

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 27 (20 by maintainers)

Most upvoted comments

We could create a pure Python library to define a abstract public API for these sorts of things. Then we aren’t relying on duck typing