hudi: Hudi upsert hangs

Describe the problem you faced When we upsert data into Hudi, we’re finding that the job just hangs in some cases. Specifically, we have an ETL pipeline where we re-ingest a lot of data (i.e. we upsert data that already exists in the Hudi table). When the proportion of data that is not new is very high, the Hudi spark job seems to hang before writing out the updated table.

Note that this currently affects 2 of the 80 tables in our ETL pipeline and the rest run fine.

To Reproduce See gist at: https://gist.github.com/bwu2/89f98e0926374f71c80e4b2fa5089f18

The code there creates a Hudi table with 4m rows. It then upserts another 4m rows, 3.5m of which are the same as the original 4m.

Note that bulk parallelism of the initial load is deliberately set to 1 to ensure we avoid lots of small files.

Running this code on an EMR cluster (either interactively in a PySpark shell or spark-submit) causes the upsert job never to finish, being stuck somewhere in the Spark job with description (from the Spark history server): count at HoodieSparkSqlWriter.scala:255 (after the stage mapToPair at HoodieWriteClient.java:492 and before/during the stage count at HoodieSparkSqlWriter.scala:255).

For a table this small, it shouldn’t matter about cores/memory/executors/instance type but we have varied these too with no success.

Expected behavior Expected the upsert job to succeed and the total number of rows in the table to be 4.5m.

**Environment Description Running on EMR 5.29.0

Hudi version : tested on 0.5.0, 0.5.1 and latest build off master
Spark version : 2.4.4
Hive version : N/A
Hadoop version : 2.8.5 (Amazon)
Storage (HDFS/S3/GCS…) : S3
Running on Docker? (yes/no) : NO

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 3
Comments: 17 (11 by maintainers)

Most upvoted comments

Fix landed on master

vinothchandar on Mar 3, 2020

Reposting my response here…

There seems to be a lot of common concerns here… https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide is an useful resource, that hopefully can benefit here…

Few high level thoughts:

It would be good to layout if the most time spent is on the indexing stages (ones tagged with HoodieBloomIndex) or the actual writing…
Hudi does keep the input in memory to compute the stats it needs to size files. So if you don’t provide sufficient executore/rdd storage memory, it will spill and can cause slowdowns… (covered in tuning guide & have seen this happen with users often)
On workload pattern itself, BloomIndex range pruning can be turned off https://hudi.apache.org/docs/configurations.html#bloomIndexPruneByRanges if the keys ranges are random anyway… Generally speaking, unless we have RFC-8 (record level indexing), cases of random write/upserting majority of the rows in a table, may give bloom index overhead, since the bloom filters/ranges are not at all useful in pruning out files . We have an interim solution coming out in the next release… falling back to plain old join to implement the indexing.
In terms or MOR and COW, MOR will help only if you have lots of updates and bottleneck is on the writing…
If listing is an issue, please turn the following so the table is listed once and we re-use the filesytem metadata hoodie.embed.timeline.server=true

I would appreciate a JIRA, so that I can break each into sub-task and tackle/resolve independently…

I am personally focussing on performance now and want to make it lot faster in 0.6.0 release. So all this help would be deeply appreciated

vinothchandar on Feb 13, 2020