hudi: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.
Describe the problem you faced
We are testing the Embedded Hudi Connector on copy-on-write table using trino-405 (latest stable version), but we ran into serious performance problem. We will have a very large number of partitions in a table and we made a minimal test set for this.
To Reproduce
Steps to reproduce the behavior:
test table data: hudi_reproduce.tar.gz desc: This test table has many partitions and parititoined by day, type. There are 657 data in total.
- Import data and run a hiveql to repair partitions.
CREATE EXTERNAL TABLE `website.hudi_reproduce`(
`_hoodie_commit_time` string,
`_hoodie_commit_seqno` string,
`_hoodie_record_key` string,
`_hoodie_partition_path` string,
`_hoodie_file_name` string,
`uniquekey` string)
PARTITIONED BY (
`day` bigint,
`type` bigint)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'hoodie.query.as.ro.table'='false',
'path'='hdfs://xxx/hudi/warehouse/hudi_reproduce')
STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://xxx/hudi/warehouse/hudi_reproduce'
TBLPROPERTIES (
'last_commit_time_sync'='20230111113655773',
'last_modified_time'='1673406649',
'spark.sql.sources.provider'='hudi',
'spark.sql.sources.schema.numPartCols'='2',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"uniqueKey","type":"string","nullable":true,"metadata":{}},{"name":"day","type":"long","nullable":true,"metadata":{}},{"name":"type","type":"long","nullable":true,"metadata":{}}]}',
'spark.sql.sources.schema.partCol.0'='day',
'spark.sql.sources.schema.partCol.1'='type',
'transient_lastDdlTime'='1673406649');
-- repair partitions.
msck repair table website.hudi_reproduce;
- Run trino sql to query:
-- we want to query the data that type was between 1 and 9 and day between 20230101 and 20230104
select count(1) from hudi.website.hudi_reproduce where day between 20230101 and 20230104 and type between 1 and 9;
- Query too slow:

Expected behavior
Can query as fast as hive table.
Environment Description
-
Hudi version : 0.12.2
-
Hive version : 2.3.9
-
Hadoop version : 2.8.5
-
Trino version: 405
-
Number of trino worker: 8
-
Storage (HDFS/S3/GCS…) : HDFS
-
Running on Docker? (yes/no) : no
Additional context
Share our trino server.log, hope this helps you. hudi_reproduce_trino_server_log.log
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (16 by maintainers)
@codope Thank you for your work! I am also looking forward to seeing a powerful hudi connector. Also, I have an additional question, do you think the get all partitions appearing in the logs is a normal behavior?
@codope ok. I will try it this week.
@BruceKellan I have a working patch with significant performance gains. On your table, i could see 50-60% latency reduction. https://github.com/codope/trino/pull/23 Can you try above patch? Let me know if you have trouble building, then I can share the trino-server tarball with you. I need to make a few minor changes before I can raise a PR against the Trino repo. But, early feedback from you would helpful.
@BruceKellan thanks for the detailed context! This is very helpful
cc @yihua
The regression should be a blocker for release 0.13.0, have created a JIRA: https://issues.apache.org/jira/browse/HUDI-5552