dolphinscheduler: [Bug] [MasterFailoverService] syncMasterNodes does not make current slot correctly after zookeeper reconnect
Search before asking
- I had searched in the issues and found no similar issues.
What happened
We use Dolphinscheduler on K8S and deploy 3 masters. Version is 3.0.5
After zookeeper disconnect, registry could reconnect successfully. but current slot is miscalculated. below is our logs(edited):
master0:
19:50:00.961 MasterSchedulerBootstrap:[233] - Master schedule bootstrap loop command success, command size: 6, current slot: 2, total slot size: 3
19:50:21.305 ServerNodeManager:[302] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 down.
19:50:21.316 ServerNodeManager:[364] - update master nodes, master size: 3, slot: 0, addr: dolphinscheduler-master-0.dolphinscheduler-master-headless:5678
19:50:21.323 ServerNodeManager:[298] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 added.
19:50:21.330 ServerNodeManager:[364] - update master nodes, master size: 3, slot: 0, addr: dolphinscheduler-master-0.dolphinscheduler-master-headless:5678
19:50:22.019 MasterSchedulerBootstrap:[233] - Master schedule bootstrap loop command success, command size: 1, current slot: 0, total slot size: 3
master1:
19:50:21.304 ServerNodeManager:[302] - [WorkflowInstance-0][TaskInstance-0] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 down.
19:50:21.312 ServerNodeManager:[364] - [WorkflowInstance-0][TaskInstance-0] - update master nodes, master size: 3, slot: 1, addr: dolphinscheduler-master-1.dolphinscheduler-master-headless:5678
19:50:21.319 ServerNodeManager:[298] - [WorkflowInstance-0][TaskInstance-0] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 added.
19:50:21.327 ServerNodeManager:[364] - [WorkflowInstance-0][TaskInstance-0] - update master nodes, master size: 3, slot: 1, addr: dolphinscheduler-master-1.dolphinscheduler-master-headless:5678
master2:
19:50:01.097 MasterSchedulerBootstrap:[233] - Master schedule bootstrap loop command success, command size: 7, current slot: 0, total slot size: 3
19:50:21.305 ServerNodeManager:[302] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 down.
19:50:21.324 ServerNodeManager:[364] - update master nodes, master size: 0, slot: 0, addr: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
19:50:21.428 ServerNodeManager:[298] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 added.
19:50:21.434 ServerNodeManager:[364] - update master nodes, master size: 0, slot: 0, addr: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
master2 is briefly disconnected and reconnected. after disconnect master0 changed slot from 2 to 0. then reconnect, master2 is still slot 0, not 2 causing our 1/3 requests are not handled by any master.
master2 detailed logs are here:
[INFO] 2023-04-09 19:50:20.701 +0900 org.apache.zookeeper.ClientCnxn:[1158] - [WorkflowInstance-0][TaskInstance-0] - Unable to read additional data from server sessionid 0x3032c34cb87185b, likely server has closed socket, closing socket connection and attempting reconnect
[WARN] 2023-04-09 19:50:20.719 +0900 org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThreadPool:[109] - [WorkflowInstance-577314][TaskInstance-0] - The workflow has been executed by another thread
[INFO] 2023-04-09 19:50:20.804 +0900 org.apache.curator.framework.state.ConnectionStateManager:[251] - [WorkflowInstance-0][TaskInstance-0] - State change: SUSPENDED
[WARN] 2023-04-09 19:50:20.808 +0900 org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener:[52] - [WorkflowInstance-0][TaskInstance-0] - Registry suspended
[WARN] 2023-04-09 19:50:20.811 +0900 org.apache.dolphinscheduler.server.master.registry.MasterConnectionStateListener:[48] - [WorkflowInstance-0][TaskInstance-0] - registry connection state is SUSPENDED, ready to retry connection
[INFO] 2023-04-09 19:50:21.292 +0900 org.apache.zookeeper.ClientCnxn:[1025] - [WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server zookeeper-1.zookeeper-headless/10.233.125.25:2181. Will not attempt to authenticate using SASL (unknown error)
[INFO] 2023-04-09 19:50:21.295 +0900 org.apache.zookeeper.ClientCnxn:[879] - [WorkflowInstance-0][TaskInstance-0] - Socket connection established to zookeeper-1.zookeeper-headless/10.233.125.25:2181, initiating session
[INFO] 2023-04-09 19:50:21.299 +0900 org.apache.zookeeper.ClientCnxn:[1299] - [WorkflowInstance-0][TaskInstance-0] - Session establishment complete on server zookeeper-1.zookeeper-headless/10.233.125.25:2181, sessionid = 0x3032c34cb87185b, negotiated timeout = 30000
[INFO] 2023-04-09 19:50:21.300 +0900 org.apache.curator.framework.state.ConnectionStateManager:[251] - [WorkflowInstance-0][TaskInstance-0] - State change: RECONNECTED
[INFO] 2023-04-09 19:50:21.305 +0900 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[302] - [WorkflowInstance-0][TaskInstance-0] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 down.
[INFO] 2023-04-09 19:50:21.304 +0900 org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[124] - [WorkflowInstance-0][TaskInstance-0] - MASTER node deleted : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
[INFO] 2023-04-09 19:50:21.308 +0900 org.apache.dolphinscheduler.server.master.service.FailoverService:[53] - [WorkflowInstance-0][TaskInstance-0] - Master failover starting, masterServer: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
[WARN] 2023-04-09 19:50:21.323 +0900 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[362] - [WorkflowInstance-0][TaskInstance-0] - current addr:dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 is not in active master list
[INFO] 2023-04-09 19:50:21.324 +0900 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[364] - [WorkflowInstance-0][TaskInstance-0] - update master nodes, master size: 0, slot: 0, addr: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
[INFO] 2023-04-09 19:50:21.428 +0900 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[298] - [WorkflowInstance-0][TaskInstance-0] - master node : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 added.
[WARN] 2023-04-09 19:50:21.434 +0900 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[362] - [WorkflowInstance-0][TaskInstance-0] - current addr:dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 is not in active master list
[INFO] 2023-04-09 19:50:21.434 +0900 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[364] - [WorkflowInstance-0][TaskInstance-0] - update master nodes, master size: 0, slot: 0, addr: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
[WARN] 2023-04-09 19:50:22.186 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:23.187 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:24.188 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:25.189 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:26.190 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:27.191 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:28.192 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:29.192 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[INFO] 2023-04-09 19:50:29.353 +0900 org.apache.dolphinscheduler.server.master.service.FailoverService:[55] - [WorkflowInstance-0][TaskInstance-0] - Master failover finished, masterServer: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
[INFO] 2023-04-09 19:50:29.353 +0900 org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener:[63] - [WorkflowInstance-0][TaskInstance-0] - master node added : /nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
[WARN] 2023-04-09 19:50:30.193 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:31.194 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:32.195 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:33.196 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:34.197 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:35.197 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:36.198 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:37.199 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:38.199 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:39.200 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:40.201 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
[WARN] 2023-04-09 19:50:41.201 +0900 org.apache.dolphinscheduler.server.master.runner.MasterSchedulerBootstrap:[226] - [WorkflowInstance-0][TaskInstance-0] - Master count: 0 is invalid, the current slot: 0
What you expected to happen
The reconnected master2 should calculate current slot 2 not 0.
How to reproduce
I’m not sure how to reproduce this problem. But in our envs, this happens once every 3 - 4 days.
Anything else
No response
Version
3.0.x
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 19