hudi: [SUPPORT] Deadlock on Hudi Java Client in OCC mode
Describe the problem you faced Trying to mock a distributed system with a test running Hudi Java Client in OCC mode. link
Running into a scenario where there is starvation waiting for locks just using 3 writers to mimic 3 distributed machines writing to the same table. The performance doesn’t seem practical the way I’m testing it. Trying to understand how to optimize or what not to do.
The starvation exists when using both the ZooKeeper and FS lock providers but it more prominent on ZK since there are multiple requests for locks which results in infinite starvation.
TLDR; Run the below test, after a few writes, the client goes into a starvation phase and remains idle doing no work and eventually failing with the below exception
org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock object
To Reproduce
Run the test here and look at the logs and the occ/tmp/hudiTest dir for the test table.
Steps to reproduce the behavior:
- Just run the test to reproduce the starvation using FS lock proviser.
- To reproduce Zookeeper starvation scenario, comment lines 151-156 and Uncomment lines 160-168
- Install Docker and run
docker run -d --name zookeeper -p 2181:2181 jplock/zookeeper - Delete the
occ/tmpdirectory and re-run the test - The test will hang due to starvation after a few seconds of running. You can inspect the Zookeeper locks being held un-released as shown below.
- Download Zookeeper client and do
sh /opt/zookeeper-3.7.1-bin/bin/zkCli.sh -server 127.0.0.1:2181 - After the client connects, do
ls /test/test_table
Expected behavior Test completes with reasonable performance - The test generates records with keys with range 0-99 10 times. Each partition should have 1 insert and 9 updates happening in parallel.
A clear and concise description of what you expected to happen.
OCC mode having reasonable performance using the Java Client to support high throughput writes/updates.
Environment Description
-
Hudi version : 0.12.2
-
Spark version :
-
Hive version :
-
Hadoop version :
-
Storage (HDFS/S3/GCS…) : Local FS
-
Running on Docker? (yes/no) : No
Additional context
Add any other context about the problem here.
Stacktrace Test runs for a while and then starves at log point below
2023-01-12 00:59:03,814 [INFO ] ConnectionStateManager - State change: CONNECTED
2023-01-12 00:59:09,199 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 00:59:09,739 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:00:04,821 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:00:10,215 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:00:10,756 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:01:05,839 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:01:11,235 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:01:11,771 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:02:06,856 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:02:12,255 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:02:12,789 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:03:07,875 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:03:13,272 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:03:13,802 [INFO ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
It eventually fails with an error
org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock object
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 19 (13 by maintainers)
If you are interested to contribute, let us know. we can assist/guide you if need be.
I think we need, seems there are some issues with the fs view refresh.