dask: Error importing Parquet from HDFS using PyArrow
Hi,
I get the following error when attempting to read a parquet file stored on hdfs:
df = dataframe.read_parquet('hdfs://hdfsnn:8020/user/hdfs/ndsparq/Test.parquet', engine='arrow')
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-3-22d8573a1d6f> in <module>()
----> 1 df = dataframe.read_parquet('hdfs://hdfsnn:8020/user/hdfs/ndsparq/Test.parquet', engine='arrow')
/opt/anaconda3/envs/env0/lib/python3.6/site-packages/dask/dataframe/io/parquet.py in read_parquet(path, columns, filters, categories, index, storage_options, engine)
293 return _read_pyarrow(fs, paths, file_opener, columns=columns,
294 filters=filters,
--> 295 categories=categories, index=index)
296
297
/opt/anaconda3/envs/env0/lib/python3.6/site-packages/dask/dataframe/io/parquet.py in _read_pyarrow(fs, paths, file_opener, columns, filters, categories, index)
164 columns = list(columns)
165
--> 166 dataset = api.ParquetDataset(paths)
167 schema = dataset.schema.to_arrow_schema()
168 task_name = 'read-parquet-' + tokenize(dataset, columns)
/opt/anaconda3/envs/env0/lib/python3.6/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema)
619
620 (self.pieces, self.partitions,
--> 621 self.metadata_path) = _make_manifest(path_or_paths, self.fs)
622
623 if self.metadata_path is not None:
/opt/anaconda3/envs/env0/lib/python3.6/site-packages/pyarrow/parquet.py in _make_manifest(path_or_paths, fs, pathsep)
779 if not fs.isfile(path):
780 raise IOError('Passed non-file path: {0}'
--> 781 .format(path))
782 piece = ParquetDatasetPiece(path)
783 pieces.append(piece)
OSError: Passed non-file path: hdfs://hdfsnn:8020/user/hdfs/ndsparq/Test.parquet
The same parquet file can be read using PyArrow without issue:
with hdfs.open('/user/hdfs/ndsparq/Test.parquet', 'rb') as f:
table = pq.read_table(f)
df = table.to_pandas()
Are you able to help? It looks like the filesystem is not passed to api.ParquetDataset in _read_pyarrow().
Thanks very much!
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 27 (20 by maintainers)
We could create a pure Python library to define a abstract public API for these sorts of things. Then we aren’t relying on duck typing