hudi: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Describe the problem you faced

We are testing the Embedded Hudi Connector on copy-on-write table using trino-405 (latest stable version), but we ran into serious performance problem. We will have a very large number of partitions in a table and we made a minimal test set for this.

To Reproduce

Steps to reproduce the behavior:

test table data: hudi_reproduce.tar.gz desc: This test table has many partitions and parititoined by day, type. There are 657 data in total.

image image
  1. Import data and run a hiveql to repair partitions.
CREATE EXTERNAL TABLE `website.hudi_reproduce`(
`_hoodie_commit_time` string,
`_hoodie_commit_seqno` string,
`_hoodie_record_key` string,
`_hoodie_partition_path` string,
`_hoodie_file_name` string,
`uniquekey` string)
PARTITIONED BY (
`day` bigint,
`type` bigint)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'hoodie.query.as.ro.table'='false',
'path'='hdfs://xxx/hudi/warehouse/hudi_reproduce')
STORED AS INPUTFORMAT
'org.apache.hudi.hadoop.HoodieParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://xxx/hudi/warehouse/hudi_reproduce'
TBLPROPERTIES (
'last_commit_time_sync'='20230111113655773',
'last_modified_time'='1673406649',
'spark.sql.sources.provider'='hudi',
'spark.sql.sources.schema.numPartCols'='2',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"uniqueKey","type":"string","nullable":true,"metadata":{}},{"name":"day","type":"long","nullable":true,"metadata":{}},{"name":"type","type":"long","nullable":true,"metadata":{}}]}',
'spark.sql.sources.schema.partCol.0'='day',
'spark.sql.sources.schema.partCol.1'='type',
'transient_lastDdlTime'='1673406649');

-- repair partitions.
msck repair table website.hudi_reproduce;
  1. Run trino sql to query:
-- we want to query the data that type was between 1 and 9 and day between 20230101 and 20230104
select count(1) from hudi.website.hudi_reproduce where day between 20230101 and 20230104 and type between 1 and 9;
  1. Query too slow: image

Expected behavior

Can query as fast as hive table.

Environment Description

  • Hudi version : 0.12.2

  • Hive version : 2.3.9

  • Hadoop version : 2.8.5

  • Trino version: 405

  • Number of trino worker: 8

  • Storage (HDFS/S3/GCS…) : HDFS

  • Running on Docker? (yes/no) : no

Additional context

Share our trino server.log, hope this helps you. hudi_reproduce_trino_server_log.log

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (16 by maintainers)

Most upvoted comments

@codope Thank you for your work! I am also looking forward to seeing a powerful hudi connector. Also, I have an additional question, do you think the get all partitions appearing in the logs is a normal behavior?

@codope ok. I will try it this week.

@BruceKellan I have a working patch with significant performance gains. On your table, i could see 50-60% latency reduction. https://github.com/codope/trino/pull/23 Can you try above patch? Let me know if you have trouble building, then I can share the trino-server tarball with you. I need to make a few minor changes before I can raise a PR against the Trino repo. But, early feedback from you would helpful.

@BruceKellan thanks for the detailed context! This is very helpful

cc @yihua

The regression should be a blocker for release 0.13.0, have created a JIRA: https://issues.apache.org/jira/browse/HUDI-5552