cudf: [FEA] Improve exception message when unknown Parquet page encoding detected

Is your feature request related to a problem? Please describe. A user of the RAPIDS Accelerator for Apache Spark reported the following exception:

ai.rapids.cudf.CudfException: CUDF failure at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-424-cuda11/thirdparty/cudf/cpp/src/io/parquet/reader_impl_preprocess.cu:346: Unsupported page encoding detected
	at ai.rapids.cudf.ParquetChunkedReader.readChunk(Native Method)
	at ai.rapids.cudf.ParquetChunkedReader.readChunk(ParquetChunkedReader.java:111)
	at com.nvidia.spark.rapids.ParquetTableReader.$anonfun$next$1(GpuParquetScan.scala:2639)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.ParquetTableReader.next(GpuParquetScan.scala:2638)
	at com.nvidia.spark.rapids.ParquetTableReader.next(GpuParquetScan.scala:2615)
	at com.nvidia.spark.rapids.CachedGpuBatchIterator$.$anonfun$apply$1(GpuDataProducer.scala:161)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.CachedGpuBatchIterator$.apply(GpuDataProducer.scala:158

The exception message from libcudf is not very helpful in that it says an unsupported page encoding was detected but not what that unexpected page encoding was (i.e.: the enum value). Without this information, we’re left guessing what encoding was found in the file and usually have to request users to share a sample file to find out. Not all users are willing to share sample files.

Describe the solution you’d like Exception messages for an unexpected/unsupported value should show the value as part of the exception message.

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Comments: 16 (16 by maintainers)

Commits related to this issue

Most upvoted comments

Ah I see. I was fixated on the old CUDF_EXPECTS and std::all code (which is no longer there) so I skimmed through the new code. But now i see that there is still work to be done. I apologize, I end up having to do a lot of context switching so I missed this.

Thanks

None of the above are trivial, and many open issues are still open for a reason 😅 If you want to dig deeper into the code (but maybe not too deep), I can suggest #13837 You can also explore https://github.com/rapidsai/cudf/issues?page=1&q=is%3Aissue+is%3Aopen++label%3Acuio

Thank you so much! I assumed that the good first issue label was for easier work. In that case, I’ll do https://github.com/rapidsai/cudf/issues/14661 and depending on how that goes, hopefully I can pick a few more from the backlog. Thanks again 😃

I was talking about the std:all_of call 😉 That’s my old detection code from #12754.

This has not yet been fully addressed. Right now the error code for unsupported option is set during header parsing, and available on the host side in decode_page_headers. All that needs to be done is to test the returned error code for the correct bit, then do a transform_reduce on the page encoding field (after turning the encoding into a mask), and then check the set bits to find the offending encoding(s). Soon the only unsupported encoding will be byte_stream_split, so this hasn’t been a high priority for me. Good first issue.