presto: Incorrect comparisson of parquet binary statistics for accented characters
On version 0.216 presto incorrectly assumes that a binary column statistic is corrupt due to wrong ordering of accented values. The root cause is probably the naive comparison made by the slice library here: https://github.com/prestodb/presto/blob/master/presto-parquet/src/main/java/com/facebook/presto/parquet/predicate/TupleDomainParquetPredicate.java#L201
I have added a simple test case on TupleDomainParquetPredicateTest
that should not fail
@Test
public void testAccentedString() throws ParquetCorruptionException {
String column = "StringColumn";
assertEquals(getDomain(createUnboundedVarcharType(), 10, stringColumnStats("Áncash", "china"), ID, column,
true), create(ValueSet.ofRanges(range(createUnboundedVarcharType(), utf8Slice("Áncash"), true, utf8Slice("china"), true)), false));
}
but it fails with
com.facebook.presto.parquet.ParquetCorruptionException: Corrupted statistics for column "StringColumn" in Parquet file "testFile": [min: Áncash, max: china, num_nulls: 0]
Áncash
comes before China
, but presto flags the statistics as corrupt since it does not use natural ordering to sort binary statistics.
As additional information, the files that led me to this error were generated by spark
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 3
- Comments: 23 (8 by maintainers)
ping @findepi – can someone from the Presto team take a look at this?
We hit this problem as well, also with an unregular string as the minimum in the Parquet statistics.
If the reason for this is that the comparison strategy for Parquet and Presto are different, that sounds like pretty bad news to me.
Say we have strings A, B, C. If the order for Parquet is A, B, C but for Presto is A, C, B. (I don’t mean to imply that Parquet’s order is correct, while Presto’s isn’t – this is just an example)
In the best case, say we have a Parquet page of Bs and Cs, the stats will show min=B and max=C and Presto will raise the corrupt stats exception.
In the worst case, say we have a Parquet page of As, Bs, and Cs, the stats will show min=A, max=C. Presto will not raise an exception as it also finds that A < C. But if Presto is looking for Bs, it will completely skip the page because it believes that B > C. No exception will be raised and Presto will return the wrong result.
Am I missing something?
Just to make sure you and others have a workaround, these stats checks can be turned off with the
hive.parquet.fail-on-corrupted-statistics
config property and theparquet_fail_with_corrupted_statistics
session property.@igorcalabria I took a quick look, and if I do
"Áncash".compareTo("China")
orSliceUtf8.compareUtf16BE(minSlice, maxSlice) > 0
I still get a positive number (so the test still fails), which means “Áncash” lexicographically follows “China”. Can you please double check whether your test case is valid?I think we should replace
minSlice.compareTo(maxSlice) > 0
withSliceUtf8.compareUtf16BE(minSlice, maxSlice) > 0
anyways. @zhenxiao what do you think?