hudi: [SUPPORT] Exception on snapshot query on MOR table (hudi 0.6.0)

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

have a table with 100GB data and under compaction
kill the spark job
try to read the data by snapshot query

val df = spark.read.format("org.apache.hudi")
.option("hoodie.datasource.query.type","snapshot")
.load("s3://path_to_data/*")

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Hudi version : 0.6.0
Spark version : 2.4.4
Hive version : not using
Hadoop version : 3.2.1
Storage (HDFS/S3/GCS…) : s3
Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here.

Stacktrace

Exception: Task failed while writing rows.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:257)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 4191
	at org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary.decodeToDouble(PlainValuesDictionary.java:208)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToDouble(ParquetDictionary.java:46)
	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getDouble(OnHeapColumnVector.java:460)
	at org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getDouble(MutableColumnarRow.java:126)
	at org.apache.spark.sql.execution.vectorized.MutableColumnarRow.get(MutableColumnarRow.java:178)
	at org.apache.hudi.HoodieMergeOnReadRDD$$anon$2.$anonfun$createRowWithRequiredSchema$1(HoodieMergeOnReadRDD.scala:239)
	at org.apache.hudi.HoodieMergeOnReadRDD$$anon$2.$anonfun$createRowWithRequiredSchema$1$adapted(HoodieMergeOnReadRDD.scala:237)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
	at org.apache.hudi.HoodieMergeOnReadRDD$$anon$2.createRowWithRequiredSchema(HoodieMergeOnReadRDD.scala:237)
	at org.apache.hudi.HoodieMergeOnReadRDD$$anon$2.hasNext(HoodieMergeOnReadRDD.scala:197)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:636)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:244)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:242)
	... 9 more

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 18 (17 by maintainers)

Most upvoted comments

I have the same sporadic issue, using standard Spark 2.4.7 distribution and Hudi 0.6:

$ ls -l /opt/spark-2.4.7-bin-without-hadoop/jars/parquet-*
 /opt/spark-2.4.7-bin-without-hadoop/jars/parquet-column-1.10.1.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-common-1.10.1.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-encoding-1.10.1.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-format-2.4.0.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-hadoop-1.10.1.jar
/opt/spark-2.4.7-bin-without-hadoop/jars/parquet-jackson-1.10.1.jar

the only workaround we found is to disable VectorizedReader:

      rc.spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")

adaniline-paytm on Jan 22, 2021