presto: Query Errors out on Hive Bucket table
We have a table with Hive partitions and then “bucketing” on the second column. The queries work fine in Hive, but when we try to access it in Presto, it errors out saying the “hive table is corrupt”. This is the sample of what we are trying to do -
create table TEST_HIVE_BUCKETS (
COL1 string,
COL2 string
)
PARTITIONED BY (
eff_dt timestamp)
CLUSTERED BY( COL2) INTO 50 BUCKETS;
Load data for atleast 2-3 days. When we run the queries
presto:default> select * from test_hive_buckets;
Query 20170720_145352_00039_m57j6 failed: Hive table is corrupt. It is declared as being bucketed, but the files do not match the bucketing declaration. The number of files in the directory (50) does not match the declared bucket count (5) for partition: eff_dt=2016-12-02 00%3A00%3A00
Looks like the number of files under the folder should exactly match with the number of buckets on the table definition. Is this the case?
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 6
- Comments: 22 (5 by maintainers)
In case it helps, the latest version from Starburst has the code fix for the above scenario. Check out the release notes for more details - https://docs.starburstdata.com/latest/release/release-0.195-e.html
On Mon, Apr 2, 2018 at 8:11 AM, rabinnh notifications@github.com wrote:
– Sajumon Joseph 303-378-0593 sajjoseph@gmail.com
Fair enough. But as the crux of my issue is inserting into a bucketed table such that Presto will always be able to query it, doing away with the SORTED BY part of the DDL is a fair tradeoff.
Thanks for the clarification though; someone who finds this thread will now have accurate information.