hudi: [SUPPORT] Deadlock on Hudi Java Client in OCC mode

Describe the problem you faced Trying to mock a distributed system with a test running Hudi Java Client in OCC mode. link

Running into a scenario where there is starvation waiting for locks just using 3 writers to mimic 3 distributed machines writing to the same table. The performance doesn’t seem practical the way I’m testing it. Trying to understand how to optimize or what not to do.

The starvation exists when using both the ZooKeeper and FS lock providers but it more prominent on ZK since there are multiple requests for locks which results in infinite starvation.

TLDR; Run the below test, after a few writes, the client goes into a starvation phase and remains idle doing no work and eventually failing with the below exception org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock object

To Reproduce Run the test here and look at the logs and the occ/tmp/hudiTest dir for the test table.

Steps to reproduce the behavior:

  1. Just run the test to reproduce the starvation using FS lock proviser.
  2. To reproduce Zookeeper starvation scenario, comment lines 151-156 and Uncomment lines 160-168
  3. Install Docker and run docker run -d --name zookeeper -p 2181:2181 jplock/zookeeper
  4. Delete the occ/tmp directory and re-run the test
  5. The test will hang due to starvation after a few seconds of running. You can inspect the Zookeeper locks being held un-released as shown below.
  6. Download Zookeeper client and do sh /opt/zookeeper-3.7.1-bin/bin/zkCli.sh -server 127.0.0.1:2181
  7. After the client connects, do ls /test/test_table

Expected behavior Test completes with reasonable performance - The test generates records with keys with range 0-99 10 times. Each partition should have 1 insert and 9 updates happening in parallel.

A clear and concise description of what you expected to happen.

OCC mode having reasonable performance using the Java Client to support high throughput writes/updates.

Environment Description

  • Hudi version : 0.12.2

  • Spark version :

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS…) : Local FS

  • Running on Docker? (yes/no) : No

Additional context

Add any other context about the problem here.

Stacktrace Test runs for a while and then starves at log point below

2023-01-12 00:59:03,814 [INFO  ] ConnectionStateManager - State change: CONNECTED
2023-01-12 00:59:09,199 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 00:59:09,739 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:00:04,821 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:00:10,215 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:00:10,756 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:01:05,839 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:01:11,235 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:01:11,771 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:02:06,856 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:02:12,255 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:02:12,789 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:03:07,875 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:03:13,272 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table
2023-01-12 01:03:13,802 [INFO  ] ZookeeperBasedLockProvider - ACQUIRING lock atZkBasePath = /test, lock key = test_table

It eventually fails with an error org.apache.hudi.exception.HoodieLockException: Unable to acquire lock, lock object

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 19 (13 by maintainers)

Most upvoted comments

If you are interested to contribute, let us know. we can assist/guide you if need be.

I think we need, seems there are some issues with the fs view refresh.