presto: Query Errors out on Hive Bucket table

We have a table with Hive partitions and then “bucketing” on the second column. The queries work fine in Hive, but when we try to access it in Presto, it errors out saying the “hive table is corrupt”. This is the sample of what we are trying to do -

create table TEST_HIVE_BUCKETS (
COL1 string,
COL2 string
)
PARTITIONED BY (
  eff_dt timestamp)
  CLUSTERED BY( COL2) INTO 50 BUCKETS;

Load data for atleast 2-3 days. When we run the queries

presto:default> select * from test_hive_buckets;
Query 20170720_145352_00039_m57j6 failed: Hive table is corrupt. It is declared as being bucketed, but the files do not match the bucketing declaration. The number of files in the directory (50) does not match the declared bucket count (5) for partition: eff_dt=2016-12-02 00%3A00%3A00

Looks like the number of files under the folder should exactly match with the number of buckets on the table definition. Is this the case?

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 6
  • Comments: 22 (5 by maintainers)

Most upvoted comments

In case it helps, the latest version from Starburst has the code fix for the above scenario. Check out the release notes for more details - https://docs.starburstdata.com/latest/release/release-0.195-e.html

On Mon, Apr 2, 2018 at 8:11 AM, rabinnh notifications@github.com wrote:

I’ve been working on this issue for a week, and have posted and received some answers in both the Hive and Presto mailing-list/forums. Here’s what I have found out:

  • AFAIK, there is no way to force Hive with the Tez engine (and mr is deprecated) to create all the buckets. So no help from Hive.
  • There is no way to force Presto to continue when not all the buckets are present (bucket number does not match metastore). So no help from Presto.

So we would seem to be at an impasse. If any partition in the table doesn’t have a full complement of buckets, a Presto query will fail. I can insert with the mr engine, but it is much slower and there is no way to tell if it will continue to be supported, since it is deprecated. We run into this in our test environment, but we cannot confidently deploy to production because we can’t chance that any particular hour will result in fewer buckets in a partition. This makes Presto too fragile for our use when using buckets.

The only “solution” that I can think of is to do away with bucketing completely and to depend solely on ORC indexes, which seems like a pretty sub-standard solution.

My guess is that Facebook would almost never run into the issue, because there’s likely so much data that all buckets are always created, even with Tez.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/prestodb/presto/issues/8557#issuecomment-377937400, or mute the thread https://github.com/notifications/unsubscribe-auth/AQhEuKjUexVWM7OCbxkoVsyNJE4pLEGTks5tkjF7gaJpZM4OeOth .

– Sajumon Joseph 303-378-0593 sajjoseph@gmail.com

Fair enough. But as the crux of my issue is inserting into a bucketed table such that Presto will always be able to query it, doing away with the SORTED BY part of the DDL is a fair tradeoff.

Thanks for the clarification though; someone who finds this thread will now have accurate information.