cudf: [BUG] Read_parquet not working with spark timestamp type in parquet file
Describe the bug Reading a parquet file results in error.
Steps/Code to reproduce bug
import cudf
cudf.io.read_parquet('/path/to/file.snappy.parquet')
Output:
RuntimeError Traceback (most recent call last)
/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.7.0.dev0+834.g3985c79.dirty-py3.7-linux-x86_64.egg/cudf/io/parquet.py in read_parquet(path, engine, columns, *args, **kwargs)
17 df = cpp_read_parquet(
18 path,
---> 19 columns
20 )
21 else:
cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()
cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()
RuntimeError: Invalid gdf_dtype in type_dispatcher
Also when trying to read a non existent file:
import cudf
cudf.io.read_parquet('/path/to/nonexistent/file')
Output
/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.7.0.dev0+834.g3985c79.dirty-py3.7-linux-x86_64.egg/cudf/io/parquet.py in read_parquet(path, engine, columns, *args, **kwargs)
17 df = cpp_read_parquet(
18 path,
---> 19 columns
20 )
21 else:
cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()
cudf/bindings/parquet.pyx in cudf.bindings.parquet.cpp_read_parquet()
NameError: name 'errno' is not defined
Expected behavior File gets read if valid. If filename is invalid, the proper error gets propagated. Environment details (please complete the following information):
- Method of cuDF install: From source
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (15 by maintainers)
Ok, PR #1532 should resolve this. It’s converted to 1ms DATE64 type, to match pandas behavior.
Yeah. PyArrow also had to add a special option to write Spark-compatible Parquet files as use_deprecated_int96_timestamps, that’s enabled when writing as spark flavor.
I think in the Parquet file, it should be stored as simply INT96 physical type, with no logical type. As cuDF doesn’t have a INT96 or larger dtype, we can either always translate to timestamp or only if the Spark/Pandas key value metadata in the file indicates the column type as timestamp.
@j-ieong The data I have when read by pandas consists of strings, datetime64[ns] and bools. I’m working on reproducing it with dummy data and sharing the example soon.