cudf: [BUG] Read_parquet not working with spark timestamp type in parquet file

Describe the bug Reading a parquet file results in error.

Steps/Code to reproduce bug

import cudf
cudf.io.read_parquet('/path/to/file.snappy.parquet')

Output:

RuntimeError                              Traceback (most recent call last)
/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.7.0.dev0+834.g3985c79.dirty-py3.7-linux-x86_64.egg/cudf/io/parquet.py in read_parquet(path, engine, columns, *args, **kwargs)
     17         df = cpp_read_parquet(
     18             path,
---> 19             columns
     20         )
     21     else:

cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()

cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()

RuntimeError: Invalid gdf_dtype in type_dispatcher

Also when trying to read a non existent file:

import cudf
cudf.io.read_parquet('/path/to/nonexistent/file')

Output

/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.7.0.dev0+834.g3985c79.dirty-py3.7-linux-x86_64.egg/cudf/io/parquet.py in read_parquet(path, engine, columns, *args, **kwargs)
     17         df = cpp_read_parquet(
     18             path,
---> 19             columns
     20         )
     21     else:

cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()

cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()

NameError: name 'errno' is not defined

Expected behavior File gets read if valid. If filename is invalid, the proper error gets propagated. Environment details (please complete the following information):

  • Method of cuDF install: From source

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

Ok, PR #1532 should resolve this. It’s converted to 1ms DATE64 type, to match pandas behavior.

Yeah. PyArrow also had to add a special option to write Spark-compatible Parquet files as use_deprecated_int96_timestamps, that’s enabled when writing as spark flavor.

I think in the Parquet file, it should be stored as simply INT96 physical type, with no logical type. As cuDF doesn’t have a INT96 or larger dtype, we can either always translate to timestamp or only if the Spark/Pandas key value metadata in the file indicates the column type as timestamp.

@j-ieong The data I have when read by pandas consists of strings, datetime64[ns] and bools. I’m working on reproducing it with dummy data and sharing the example soon.