hudi: Slow Write into Hudi Dataset(MOR)

Hi Team,

I am reading data from Kafka and ingesting data into Hudi Dataset(MOR) using Hudi DataSource Api through Spark Structured Streaming. Pipeline Structure as like -

Kafka(Source) > Spark Structured Streaming(EMR) > MOR Hudi table(S3)

Spark - 2.4.5 Hudi - 0.5.2

I am getting performance issues while writing data into Hudi Dataset. following Hudi Jobs are taking time countByKey at HoodieBloomIndex.java countByKey at WorkloadProfile.java count at HoodieSparkSqlWriter.scala

Configuration used to write hudi data set as followed

new_df.write.format("org.apache.hudi").option("hoodie.table.name", tableName) \
    .option("hoodie.datasource.write.operation", "upsert") \
    .option("hoodie.datasource.write.table.type", "MERGE_ON_READ") \
    .option("hoodie.datasource.write.recordkey.field", "wbn") \
    .option("hoodie.datasource.write.partitionpath.field", "ad") \
    .option("hoodie.datasource.write.precombine.field", "action_date") \
    .option("hoodie.compact.inline", "true") \
    .option("hoodie.compact.inline.max.delta.commits", "300") \
    .option("hoodie.datasource.hive_sync.enable", "true") \
    .option("hoodie.upsert.shuffle.parallelism", "5") \
    .option("hoodie.insert.shuffle.parallelism", "5") \
    .option("hoodie.bulkinsert.shuffle.parallelism", "5") \
    .option("hoodie.datasource.hive_sync.table", tableName) \
    .option("hoodie.datasource.hive_sync.partition_fields", "ad") \
    .option("hoodie.index.type","GLOBAL_BLOOM") \
    .option("hoodie.bloom.index.update.partition.path", "true") \
    .option("hoodie.datasource.hive_sync.assume_date_partitioning", "false") \
    .option("hoodie.datasource.hive_sync.partition_extractor_class",
            "org.apache.hudi.hive.MultiPartKeysValueExtractor") \
    .mode("append").save(tablePath)

Spark Submit command - spark-submit --deploy-mode client --master yarn –executor-memory 6g --executor-cores 1 –driver-memory 4g –conf spark.driver.maxResultSize=2g –conf spark.executor.id=driver –conf spark.executor.instances=300 –conf spark.kryoserializer.buffer.max=512m –conf spark.shuffle.service.enabled=true –conf spark.sql.hive.convertMetastoreParquet=false –conf spark.task.cpus=1 –conf spark.yarn.driver.memoryOverhead=1024 –conf spark.yarn.executor.memoryOverhead=3072 –conf spark.yarn.max.executor.failures=100 –jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar –packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4 –py-files s3://spark-test/hudi_job.py Attaching screen shot for the job details. hudi-job

countByKey at HoodieBloomIndex.java countbykey-hoodiebloomindx countbykeyhoodiebloomindextask

countByKey at WorkloadProfile.java workloadprofiletask

count at HoodieSparkSqlWriter.scala sparksqlwritertask

Please suggest how I can tune this.

Thanks Raghvendra

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 24 (13 by maintainers)

Most upvoted comments

@vinothchandar I run the job with 5 min batch interval using MOR, now I can see commit duration are 5 min and compaction is also 5 min, and updated records are only 10% of total records written but now job is running with huge lag. sample commit are as below -

═══════════╗
║ CommitTime     │ Total Bytes Written │ Total Files Added │ Total Files Updated │ Total Partitions Written │ Total Records Written │ Total Update Records Written │ Total Errors ║
╠════════════════╪═════════════════════╪═══════════════════╪═════════════════════╪══════════════════════════╪═══════════════════════╪══════════════════════════════╪══════════════╣
║ 20200625112117 │ 178.0 MB            │ 1                 │ 3                   │ 2                        │ 193777                │ 18939                        │ 0            ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20200625111810 │ 104.0 MB            │ 0                 │ 1                   │ 1                        │ 149946                │ 12619                        │ 0            ║
╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
║ 20200625111610 │ 211.7 MB            │ 0                 │ 3                   │ 2                        │ 259500                │ 14721                        │ 0            ║

Raghvendradubey on Jul 2, 2020