cudf: [BUG] ORC writer produce wrong timestamp metrics which causes spark not to do predicate push down

Describe the bug PR #13848 added minimum/maximum and minimumNanos/maximumNanos for ORC Writer timestamp statistics. It was intended to fix #13899 that Spark does not do predicate push down for gpu generated timestamp files. However the predicate push down test is still fails after above PR was merged, see https://github.com/NVIDIA/spark-rapids/issues/9075.

When trying to see the meta of related files with orc-tools, it throws Exception in thread "main" java.lang.IllegalArgumentException: nanos > 999999999 or < 0. And the min max are also mismatched with cpu-generated file with same data. I think it cause Spark to fail to do pushdown.

Steps/Code to reproduce bug

spark-shell with spark-rapids:

scala> import java.sql.{Date, Timestamp}
import java.sql.{Date, Timestamp}

scala> val timeString = "2015-08-20 14:57:00"
timeString: String = 2015-08-20 14:57:00

scala> val data = (0 until 10).map { i =>
     |           val milliseconds = Timestamp.valueOf(timeString).getTime + i * 3600
     |           Tuple1(new Timestamp(milliseconds))
     |         }
data: scala.collection.immutable.IndexedSeq[(java.sql.Timestamp,)] = Vector((2015-08-20 14:57:00.0,), (2015-08-20 14:57:03.6,), (2015-08-20 14:57:07.2,), (2015-08-20 14:57:10.8,), (2015-08-20 14:57:14.4,), (2015-08-20 14:57:18.0,), (2015-08-20 14:57:21.6,), (2015-08-20 14:57:25.2,), (2015-08-20 14:57:28.8,), (2015-08-20 14:57:32.4,))

scala> val df = spark.createDataFrame(data).toDF("a")
df: org.apache.spark.sql.DataFrame = [a: timestamp]

scala> df.write.orc("ORC_PPD_GPU")

orc-tools:

java -jar orc-tools-1.9.1-uber.jar meta ORC_PPD_GPU/
[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file file:/home/haoyangl/ORC_PPD_GPU/part-00007-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc [length: 304]
Structure for file:/home/haoyangl/ORC_PPD_GPU/part-00007-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_GPU/part-00007-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 2
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 2 hasNull: true
    Column 1: count: 2 hasNull: false min: 2015-08-20 14:57:28.799999999 max: 2015-08-20 14:57:32.399999999

File Statistics:
  Column 0: count: 2 hasNull: true
  Column 1: count: 2 hasNull: false min: 2015-08-20 14:57:28.799999999 max: 2015-08-20 14:57:32.399999999

Stripes:
  Stripe: offset: 3 data: 25 rows: 2 tail: 56 index: 64
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 57
    Stream: column 1 section PRESENT start: 67 length 5
    Stream: column 1 section DATA start: 72 length 13
    Stream: column 1 section SECONDARY start: 85 length 7
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 304 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

Processing data file file:/home/haoyangl/ORC_PPD_GPU/part-00005-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc [length: 300]
Structure for file:/home/haoyangl/ORC_PPD_GPU/part-00005-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_GPU/part-00005-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: true
    Column 1: count: 1 hasNull: false min: 2015-08-20 14:57:21.599999999 max: 2015-08-20 14:57:21.599999999

File Statistics:
  Column 0: count: 1 hasNull: true
  Column 1: count: 1 hasNull: false min: 2015-08-20 14:57:21.599999999 max: 2015-08-20 14:57:21.599999999

Stripes:
  Stripe: offset: 3 data: 21 rows: 1 tail: 56 index: 64
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 57
    Stream: column 1 section PRESENT start: 67 length 5
    Stream: column 1 section DATA start: 72 length 10
    Stream: column 1 section SECONDARY start: 82 length 6
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 300 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

Processing data file file:/home/haoyangl/ORC_PPD_GPU/part-00004-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc [length: 300]
Structure for file:/home/haoyangl/ORC_PPD_GPU/part-00004-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_GPU/part-00004-087a3f38-7d6f-4ddc-9cba-66fd28b24ae0-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 1
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 1 hasNull: true
Exception in thread "main" java.lang.IllegalArgumentException: nanos > 999999999 or < 0
	at java.sql/java.sql.Timestamp.setNanos(Timestamp.java:336)
	at org.apache.orc.impl.ColumnStatisticsImpl$TimestampStatisticsImpl.getMinimum(ColumnStatisticsImpl.java:1764)
	at org.apache.orc.impl.ColumnStatisticsImpl$TimestampStatisticsImpl.toString(ColumnStatisticsImpl.java:1808)
	at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:363)
	at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)
	at org.apache.orc.tools.FileDump.main(FileDump.java:137)
	at org.apache.orc.tools.Driver.main(Driver.java:124)

Related test cases in spark-rapids: Support for pushing down filters for timestamp types

Expected behavior The statistics for orc files should be reasonable and Spark should be able to do predicate push down on gpu-generated orc files.

Environment overview (please complete the following information)

Environment location: Bare-metal
Method of cuDF install: from source

About this issue

Original URL
State: closed
Created 8 months ago
Comments: 25 (25 by maintainers)

Commits related to this issue

Fix and disable encoding for nanosecond statistics in ORC writer (#14367) Issue https://github.com/rapidsai/cudf/issues/14325 Use uint when reading/writing nano stats because nanoseconds have int3... — committed to rapidsai/cudf by vuule 8 months ago
Include writer code and writerVersion in ORC files (#14458) Closes https://github.com/rapidsai/cudf/issues/14325 Changes some of the metadata written to ORC file: - Include the (cuDF) writer code... — committed to rapidsai/cudf by vuule 7 months ago
Include writer code and writerVersion in ORC files (#14458) Closes https://github.com/rapidsai/cudf/issues/14325 Changes some of the metadata written to ORC file: - Include the (cuDF) writer code... — committed to karthikeyann/cudf by vuule 7 months ago

Most upvoted comments

Finally! Thank you for running these tests again and again 😃

vuule on Nov 22, 2023

@thirtiseven would you mind running the tests again with latest branch? I was working off of incorrect specs. Sorry to pull you into this so many times.

vuule on Nov 22, 2023

Found that the nanoseconds are encoded as value + 1, that why CPU reader complained about the range - zero would become -1. Pushed a fix for the off by one to https://github.com/rapidsai/cudf/pull/14367, @thirtiseven please verify if it fixes the issue. Statistics should now be correct for any nanosecond value.

vuule on Nov 7, 2023

Hi @vuule , I can still repro for both orc-tools and spark PPD with this branch.

thirtiseven on Nov 7, 2023

~@thirtiseven do you have the CPU version of the invalid file? I’d like to debug the writer as it creates the invalid statistics.~ Disregard, I can read the GPU file and write it back out.

vuule on Nov 7, 2023

@sameerz ok, a sample file: ORC_PPD_FAILED_GPU.zip

run:

java -jar orc-tools-1.9.1-uber.jar meta ORC_PPD_FAILED_GPU/

will get:

[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file file:/home/haoyangl/ORC_PPD_FAILED_GPU/part-00000-0c687a13-7e91-422d-af76-ce177d66dd94-c000.snappy.orc [length: 333]
Structure for file:/home/haoyangl/ORC_PPD_FAILED_GPU/part-00000-0c687a13-7e91-422d-af76-ce177d66dd94-c000.snappy.orc
File Version: 0.12 with ORIGINAL by ORC Java
[main] INFO org.apache.orc.impl.ReaderImpl - Reading ORC rows from file:/home/haoyangl/ORC_PPD_FAILED_GPU/part-00000-0c687a13-7e91-422d-af76-ce177d66dd94-c000.snappy.orc with {include: null, offset: 0, length: 9223372036854775807, includeAcidColumns: true, allowSARGToFilter: false, useSelected: false}
[main] INFO org.apache.orc.impl.RecordReaderImpl - Reader schema not provided -- using file schema struct<a:timestamp>
Rows: 10
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<a:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 10 hasNull: true
Exception in thread "main" java.lang.IllegalArgumentException: nanos > 999999999 or < 0
	at java.sql/java.sql.Timestamp.setNanos(Timestamp.java:336)
	at org.apache.orc.impl.ColumnStatisticsImpl$TimestampStatisticsImpl.getMinimum(ColumnStatisticsImpl.java:1764)
	at org.apache.orc.impl.ColumnStatisticsImpl$TimestampStatisticsImpl.toString(ColumnStatisticsImpl.java:1808)
	at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:363)
	at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)
	at org.apache.orc.tools.FileDump.main(FileDump.java:137)
	at org.apache.orc.tools.Driver.main(Driver.java:124)

thirtiseven on Oct 31, 2023