presto: Incorrect comparisson of parquet binary statistics for accented characters

On version 0.216 presto incorrectly assumes that a binary column statistic is corrupt due to wrong ordering of accented values. The root cause is probably the naive comparison made by the slice library here: https://github.com/prestodb/presto/blob/master/presto-parquet/src/main/java/com/facebook/presto/parquet/predicate/TupleDomainParquetPredicate.java#L201

I have added a simple test case on TupleDomainParquetPredicateTest that should not fail

    @Test
    public void testAccentedString() throws ParquetCorruptionException {
        String column = "StringColumn";

        assertEquals(getDomain(createUnboundedVarcharType(), 10, stringColumnStats("Áncash", "china"), ID, column,
                true), create(ValueSet.ofRanges(range(createUnboundedVarcharType(), utf8Slice("Áncash"), true, utf8Slice("china"), true)), false));
    }

but it fails with

com.facebook.presto.parquet.ParquetCorruptionException: Corrupted statistics for column "StringColumn" in Parquet file "testFile": [min: Áncash, max: china, num_nulls: 0]

Áncash comes before China, but presto flags the statistics as corrupt since it does not use natural ordering to sort binary statistics. As additional information, the files that led me to this error were generated by spark

About this issue

Original URL
State: open
Created 5 years ago
Reactions: 3
Comments: 23 (8 by maintainers)

Most upvoted comments

ping @findepi – can someone from the Presto team take a look at this?

thoralf-gutierrez on Sep 4, 2019

We hit this problem as well, also with an unregular string as the minimum in the Parquet statistics.

If the reason for this is that the comparison strategy for Parquet and Presto are different, that sounds like pretty bad news to me.

Say we have strings A, B, C. If the order for Parquet is A, B, C but for Presto is A, C, B. (I don’t mean to imply that Parquet’s order is correct, while Presto’s isn’t – this is just an example)

In the best case, say we have a Parquet page of Bs and Cs, the stats will show min=B and max=C and Presto will raise the corrupt stats exception.

In the worst case, say we have a Parquet page of As, Bs, and Cs, the stats will show min=A, max=C. Presto will not raise an exception as it also finds that A < C. But if Presto is looking for Bs, it will completely skip the page because it believes that B > C. No exception will be raised and Presto will return the wrong result.

Am I missing something?

thoralf-gutierrez on Jun 15, 2019

Just to make sure you and others have a workaround, these stats checks can be turned off with the hive.parquet.fail-on-corrupted-statistics config property and the parquet_fail_with_corrupted_statistics session property.

nezihyigitbasi on Feb 13, 2019

@igorcalabria I took a quick look, and if I do "Áncash".compareTo("China") or SliceUtf8.compareUtf16BE(minSlice, maxSlice) > 0 I still get a positive number (so the test still fails), which means “Áncash” lexicographically follows “China”. Can you please double check whether your test case is valid?

I think we should replace minSlice.compareTo(maxSlice) > 0 with SliceUtf8.compareUtf16BE(minSlice, maxSlice) > 0 anyways. @zhenxiao what do you think?

nezihyigitbasi on Feb 13, 2019