hudi: Slow Write into Hudi Dataset(MOR)
Hi Team,
I am reading data from Kafka and ingesting data into Hudi Dataset(MOR) using Hudi DataSource Api through Spark Structured Streaming. Pipeline Structure as like -
Kafka(Source) > Spark Structured Streaming(EMR) > MOR Hudi table(S3)
Spark - 2.4.5 Hudi - 0.5.2
I am getting performance issues while writing data into Hudi Dataset. following Hudi Jobs are taking time countByKey at HoodieBloomIndex.java countByKey at WorkloadProfile.java count at HoodieSparkSqlWriter.scala
Configuration used to write hudi data set as followed
new_df.write.format("org.apache.hudi").option("hoodie.table.name", tableName) \
.option("hoodie.datasource.write.operation", "upsert") \
.option("hoodie.datasource.write.table.type", "MERGE_ON_READ") \
.option("hoodie.datasource.write.recordkey.field", "wbn") \
.option("hoodie.datasource.write.partitionpath.field", "ad") \
.option("hoodie.datasource.write.precombine.field", "action_date") \
.option("hoodie.compact.inline", "true") \
.option("hoodie.compact.inline.max.delta.commits", "300") \
.option("hoodie.datasource.hive_sync.enable", "true") \
.option("hoodie.upsert.shuffle.parallelism", "5") \
.option("hoodie.insert.shuffle.parallelism", "5") \
.option("hoodie.bulkinsert.shuffle.parallelism", "5") \
.option("hoodie.datasource.hive_sync.table", tableName) \
.option("hoodie.datasource.hive_sync.partition_fields", "ad") \
.option("hoodie.index.type","GLOBAL_BLOOM") \
.option("hoodie.bloom.index.update.partition.path", "true") \
.option("hoodie.datasource.hive_sync.assume_date_partitioning", "false") \
.option("hoodie.datasource.hive_sync.partition_extractor_class",
"org.apache.hudi.hive.MultiPartKeysValueExtractor") \
.mode("append").save(tablePath)
Spark Submit command -
spark-submit --deploy-mode client --master yarn
–executor-memory 6g --executor-cores 1
–driver-memory 4g
–conf spark.driver.maxResultSize=2g
–conf spark.executor.id=driver
–conf spark.executor.instances=300
–conf spark.kryoserializer.buffer.max=512m
–conf spark.shuffle.service.enabled=true
–conf spark.sql.hive.convertMetastoreParquet=false
–conf spark.task.cpus=1
–conf spark.yarn.driver.memoryOverhead=1024
–conf spark.yarn.executor.memoryOverhead=3072
–conf spark.yarn.max.executor.failures=100
–jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar
–packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4
–py-files s3://spark-test/hudi_job.py
Attaching screen shot for the job details.

countByKey at HoodieBloomIndex.java

countByKey at WorkloadProfile.java

count at HoodieSparkSqlWriter.scala

Please suggest how I can tune this.
Thanks Raghvendra
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 24 (13 by maintainers)
@vinothchandar I run the job with 5 min batch interval using MOR, now I can see commit duration are 5 min and compaction is also 5 min, and updated records are only 10% of total records written but now job is running with huge lag. sample commit are as below -