cudf: [BUG] ORC writer produce wrong timestamp metrics which causes spark not to do predicate push down
Describe the bug PR #13848 added minimum/maximum and minimumNanos/maximumNanos for ORC Writer timestamp statistics. It was intended to fix #13899 that Spark does not do predicate push down for gpu generated timestamp files. However the predicate push down test is still fails after above PR was merged, see https://github.com/NVIDIA/spark-rapids/issues/9075.
When trying to see the meta of related files with orc-tools, it throws Exception in thread "main" java.lang.IllegalArgumentException: nanos > 999999999 or < 0. And the min max are also mismatched with cpu-generated file with same data. I think it cause Spark to fail to do pushdown.
Steps/Code to reproduce bug
spark-shell with spark-rapids:
scala> import java.sql.{Date, Timestamp}
import java.sql.{Date, Timestamp}
scala> val timeString = "2015-08-20 14:57:00"
timeString: String = 2015-08-20 14:57:00
scala> val data = (0 until 10).map { i =>
| val milliseconds = Timestamp.valueOf(timeString).getTime + i * 3600
| Tuple1(new Timestamp(milliseconds))
| }
data: scala.collection.immutable.IndexedSeq[(java.sql.Timestamp,)] = Vector((2015-08-20 14:57:00.0,), (2015-08-20 14:57:03.6,), (2015-08-20 14:57:07.2,), (2015-08-20 14:57:10.8,), (2015-08-20 14:57:14.4,), (2015-08-20 14:57:18.0,), (2015-08-20 14:57:21.6,), (2015-08-20 14:57:25.2,), (2015-08-20 14:57:28.8,), (2015-08-20 14:57:32.4,))
scala> val df = spark.createDataFrame(data).toDF("a")
df: org.apache.spark.sql.DataFrame = [a: timestamp]
scala> df.write.orc("ORC_PPD_GPU")
orc-tools:
java -jar orc-tools-1.9.1-uber.jar meta ORC_PPD_GPU/
[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file file:/home/haoyangl/ORC_PPD_GPU/part-00007-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc [length: 304]
Structure for file:/home/haoyangl/ORC_PPD_GPU/part-00007-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_GPU/part-00007-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 2
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>
Stripe Statistics:
Stripe 1:
Column 0: count: 2 hasNull: true
Column 1: count: 2 hasNull: false min: 2015-08-20 14:57:28.799999999 max: 2015-08-20 14:57:32.399999999
File Statistics:
Column 0: count: 2 hasNull: true
Column 1: count: 2 hasNull: false min: 2015-08-20 14:57:28.799999999 max: 2015-08-20 14:57:32.399999999
Stripes:
Stripe: offset: 3 data: 25 rows: 2 tail: 56 index: 64
Stream: column 0 section ROW_INDEX start: 3 length 7
Stream: column 1 section ROW_INDEX start: 10 length 57
Stream: column 1 section PRESENT start: 67 length 5
Stream: column 1 section DATA start: 72 length 13
Stream: column 1 section SECONDARY start: 85 length 7
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 304 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________
Processing data file file:/home/haoyangl/ORC_PPD_GPU/part-00005-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc [length: 300]
Structure for file:/home/haoyangl/ORC_PPD_GPU/part-00005-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_GPU/part-00005-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: true
Column 1: count: 1 hasNull: false min: 2015-08-20 14:57:21.599999999 max: 2015-08-20 14:57:21.599999999
File Statistics:
Column 0: count: 1 hasNull: true
Column 1: count: 1 hasNull: false min: 2015-08-20 14:57:21.599999999 max: 2015-08-20 14:57:21.599999999
Stripes:
Stripe: offset: 3 data: 21 rows: 1 tail: 56 index: 64
Stream: column 0 section ROW_INDEX start: 3 length 7
Stream: column 1 section ROW_INDEX start: 10 length 57
Stream: column 1 section PRESENT start: 67 length 5
Stream: column 1 section DATA start: 72 length 10
Stream: column 1 section SECONDARY start: 82 length 6
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
File length: 300 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________
Processing data file file:/home/haoyangl/ORC_PPD_GPU/part-00004-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc [length: 300]
Structure for file:/home/haoyangl/ORC_PPD_GPU/part-00004-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_GPU/part-00004-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>
Stripe Statistics:
Stripe 1:
Column 0: count: 1 hasNull: true
Exception in thread "main" java.lang.IllegalArgumentException: nanos > 999999999 or < 0
at java.sql/java.sql.Timestamp.setNanos(Timestamp.java:336)
at org.apache.orc.impl.ColumnStatisticsImpl$TimestampStatisticsImpl.getMinimum(ColumnStatisticsImpl.java:1764)
at org.apache.orc.impl.ColumnStatisticsImpl$TimestampStatisticsImpl.toString(ColumnStatisticsImpl.java:1808)
at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:363)
at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)
at org.apache.orc.tools.FileDump.main(FileDump.java:137)
at org.apache.orc.tools.Driver.main(Driver.java:124)
Related test cases in spark-rapids: Support for pushing down filters for timestamp types
Expected behavior The statistics for orc files should be reasonable and Spark should be able to do predicate push down on gpu-generated orc files.
Environment overview (please complete the following information)
- Environment location: Bare-metal
- Method of cuDF install: from source
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 25 (25 by maintainers)
Commits related to this issue
- Fix and disable encoding for nanosecond statistics in ORC writer (#14367) Issue https://github.com/rapidsai/cudf/issues/14325 Use uint when reading/writing nano stats because nanoseconds have int3... — committed to rapidsai/cudf by vuule 8 months ago
- Include writer code and writerVersion in ORC files (#14458) Closes https://github.com/rapidsai/cudf/issues/14325 Changes some of the metadata written to ORC file: - Include the (cuDF) writer code... — committed to rapidsai/cudf by vuule 7 months ago
- Include writer code and writerVersion in ORC files (#14458) Closes https://github.com/rapidsai/cudf/issues/14325 Changes some of the metadata written to ORC file: - Include the (cuDF) writer code... — committed to karthikeyann/cudf by vuule 7 months ago
Finally! Thank you for running these tests again and again š
@thirtiseven would you mind running the tests again with latest branch? I was working off of incorrect specs. Sorry to pull you into this so many times.
Found that the nanoseconds are encoded as value + 1, that why CPU reader complained about the range - zero would become -1. Pushed a fix for the off by one to https://github.com/rapidsai/cudf/pull/14367, @thirtiseven please verify if it fixes the issue. Statistics should now be correct for any nanosecond value.
Hi @vuule , I can still repro for both orc-tools and spark PPD with this branch.
~@thirtiseven do you have the CPU version of the invalid file? Iād like to debug the writer as it creates the invalid statistics.~ Disregard, I can read the GPU file and write it back out.
@sameerz ok, a sample file: ORC_PPD_FAILED_GPU.zip
run:
will get: