dapr: Darp placement - Raft and Health are not started.

Note: If you have a general support question and are looking for a quicker response, please checkout our discord channel for answers from the community: https://aka.ms/dapr-discord

In what area(s)?

/area placement

Ask your question here

Hi ,

After an upgrade of our kubernetes version , the placement server are in CrashLoopBackOff image

In the logs we only have the following :

time="2022-07-11T13:04:15.082415966Z" level=info msg="starting Dapr Placement Service -- version 1.8.0 -- commit dc7f86840c85a1eff2e1223456994f554ea31d11" instance=dapr-placement-server-0 scope=dapr.placement type=log ver=1.8.0
time="2022-07-11T13:04:15.082798161Z" level=info msg="log level set to: debug" instance=dapr-placement-server-0 scope=dapr.placement type=log ver=1.8.0
time="2022-07-11T13:04:15.082878312Z" level=info msg="metrics server started on :9090/" instance=dapr-placement-server-0 scope=dapr.metrics type=log ver=1.8.0

On my minikube instance the same configuration works with the same version of dapr and kubernetes

the logs are the following :

time="2022-07-11T13:11:30.2769734Z" level=info msg="starting Dapr Placement Service -- version 1.8.0 -- commit dc7f86840c85a1eff2e1223456994f554ea31d11" instance=dapr-placement-server-0 scope=dapr.placement type=log ver=1.8.0
time="2022-07-11T13:11:30.2771322Z" level=info msg="log level set to: debug" instance=dapr-placement-server-0 scope=dapr.placement type=log ver=1.8.0
time="2022-07-11T13:11:30.2772933Z" level=info msg="metrics server started on :9090/" instance=dapr-placement-server-0 scope=dapr.metrics type=log ver=1.8.0
time="2022-07-11T13:11:30.2789367Z" level=debug msg="initial configuration%!(EXTRA []interface {}=[index 1 servers [%+v [{Voter dapr-placement-server-0 dapr-placement-server-0.dapr-placement-server.dapr-system.svc.cluster.local:8201} {Voter dapr-placement-server-1 dapr-placement-server-1.dapr-placement-server.dapr-system.svc.cluster.local:8201} {Voter dapr-placement-server-2 dapr-placement-server-2.dapr-placement-server.dapr-system.svc.cluster.local:8201}]]])" instance=dapr-placement-server-0 scope=dapr.placement.raft type=log ver=1.8.0
time="2022-07-11T13:11:30.2790101Z" level=info msg="Raft server is starting on dapr-placement-server-0.dapr-placement-server.dapr-system.svc.cluster.local:8201..." instance=dapr-placement-server-0 scope=dapr.placement.raft type=log ver=1.8.0
time="2022-07-11T13:11:30.2790369Z" level=info msg="mTLS enabled, getting tls certificates" instance=dapr-placement-server-0 scope=dapr.placement type=log ver=1.8.0
time="2022-07-11T13:11:30.2790981Z" level=info msg="starting watch for certs on filesystem: /var/run/dapr/credentials" instance=dapr-placement-server-0 scope=dapr.placement type=log ver=1.8.0
time="2022-07-11T13:11:30.2793584Z" level=debug msg="entering follower state%!(EXTRA []interface {}=[follower Node at 172.17.0.12:8201 [Follower] leader ])" instance=dapr-placement-server-0 scope=dapr.placement.raft type=log ver=1.8.0
time="2022-07-11T13:11:30.2793662Z" level=info msg="tls certificates loaded successfully" instance=dapr-placement-server-0 scope=dapr.placement type=log ver=1.8.0
time="2022-07-11T13:11:30.280095Z" level=info msg="placement service started on port 50005" instance=dapr-placement-server-0 scope=dapr.placement type=log ver=1.8.0
time="2022-07-11T13:11:30.2802004Z" level=info msg="Healthz server is listening on :8080" instance=dapr-placement-server-0 scope=dapr.placement type=log ver=1.8.0
time="2022-07-11T13:11:30.7936267Z" level=debug msg="accepted connection%!(EXTRA []interface {}=[local-address 172.17.0.12:8201 remote-address 172.17.0.29:43182])" instance=dapr-placement-server-0 scope=dapr.placement.raft type=log ver=1.8.0
time="2022-07-11T13:11:30.7938483Z" level=debug msg="failed to get previous log%!(EXTRA []interface {}=[previous-index 10 last-index 1 error log not found])" instance=dapr-placement-server-0 scope=dapr.placement.raft type=log ver=1.8.0
time="2022-07-11T13:11:30.8737392Z" level=debug msg="accepted connection%!(EXTRA []interface {}=[local-address 172.17.0.12:8201 remote-address 172.17.0.29:43184])" instance=dapr-placement-server-0 scope=dapr.placement.raft type=log ver=1.8.0

Do you know why this can happen? Raft server and health server of the placement are not started. What can be the cause?

Thanks

Manu Di Nicola

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 24 (12 by maintainers)

Most upvoted comments

This problem only happens for us (@ManuDinicola and me) on kubernetes cluster upgrade according to this procedure: https://kubernetes.io/docs/tasks/administer-cluster/cluster-upgrade/

The problem does not occur when performing a rolling reboot of all kubernetes cluster nodes (with drain/uncordon and 2 min. wait between node reboots), even if two out of three replicas of dapr-placement-server are temporarily unavailable.

Could it be that a kubernetes cluster upgrade causes a network split brain between old kubernetes version nodes and new kubernetes version nodes, such that the state of the existing dapr-placement-server raft cluster becomes lost? If so, can we reset the dapr-placement-server raft cluster state as if it were a new raft cluster?

P.S. Redeploying dapr from scratch does not fix the issue (i.e. removing dapr-system namespace), so we have no way of resetting dapr’s state it seems…

@shubham1172 please can you investigate?

Does it happen on upgrades only?