scylla-operator: `add node` operation fails with the `already exists, cancelling join` error if a node with it's name was drained, recreated and then decommissioned

Describe the bug The following nemesis scenario was run in the setup:

  • Drain a K8S Scylla node with kubectl drain %node-name% --ignore-daemonsets --delete-local-data command.
  • Uncordon that node
  • Decrease the replica number by 1 in the scyllacluster object configuration
  • Wait for the pod deletion
  • Increase the replica number by 1 in the scyllacluster object configuration
  • Wait for the pod readiness

So, we see following in the logs:

functional_tests/scylla_operator/test_functional.py::test_drain_kubernetes_node_then_decommission_and_add_scylla_node < t:2023-05-07 11:19:55,884 f:cluster.py      l:3697 c:sdcm.cluster         p:INFO  > EksScyllaPodCluster functional-master-db-cluster-9dceedfb | Namespace: scylla: (1/3) nodes ready, node Node sct-cluster-us-east1-b-us-east1-0 [10.0.1.230 | 10.0.2.151] (seed: False). Time elapsed: 1 s
...
14:20:13  < t:2023-05-07 11:20:13,177 f:test_functional.py l:240  c:root                 p:INFO  > Drain K8S node that hosts 'Node sct-cluster-us-east1-b-us-east1-2 [10.0.0.82 | 10.0.0.84] (seed: False)' scylla node not waiting for pod absence
...
14:20:29  < t:2023-05-07 11:20:28,438 f:__init__.py     l:815  c:sdcm.utils.k8s       p:INFO  > 'scylla/sct-cluster-us-east1-b-us-east1-2' node has changed it's pod IP address from '10.0.0.84' to '10.0.2.232'. All old IPs: 10.0.1.181, 10.0.0.149, 10.0.3.201, 10.0.0.84
...
INFO  2023-05-07 11:19:34,721 [shard  0] token_metadata - Removed node 172.20.245.195 as pending replacing endpoint which replaces existing node 172.20.25.219
...
INFO  2023-05-07 11:20:13,781 [shard  0] storage_service - Node 172.20.186.151 state jump to normal
...
2023-05-07 11:20:20,515 INFO stopped: rsyslog (exit status 0)
I0507 11:20:20.551734       1 operator/sidecar.go:438] "Scylla process finished"
...
INFO  2023-05-07 11:21:24,737 [shard  9] api - decommission
INFO  2023-05-07 11:21:24,742 [shard  0] storage_service - DECOMMISSIONING: starts
INFO  2023-05-07 11:21:24,743 [shard  0] storage_service - decommission[9eda3ecf-a8b8-4e0e-b241-f65674b86fd3]: Started decommission operation, removing node=172.20.186.151, sync_nodes=[172.20.186.151, 172.20.245.195, 172.20.179.129], ignore_nodes=[]
...
INFO  2023-05-07 11:21:24,879 [shard  0] gossip - Node 172.20.186.151 will be removed from gossip at [2023-05-10 11:21:24]: (expire = 1683717684878095453, now = 1683458484879194748, diff = 259199 seconds)
...
INFO  2023-05-07 11:21:56,645 [shard  0] storage_service - DECOMMISSIONING: done
...
2023-05-07 11:22:03,703 INFO stopped: rsyslog (exit status 0)
I0507 11:22:21.312534       1 operator/sidecar.go:147] sidecar version "v1.9.0-alpha.2-32-g7b1f723"
...
INFO  2023-05-07 11:24:28,906 [shard  0] gossip - InetAddress 172.20.186.151 is now DOWN, status = LEFT
INFO  2023-05-07 11:24:28,908 [shard  0] gossip - InetAddress 172.20.179.129 is now UP, status = NORMAL
INFO  2023-05-07 11:24:28,909 [shard  0] gossip - Node 172.20.186.151 will be removed from gossip at [2023-05-10 11:21:24]: (expire = 1683717684878095453, now = 1683458668909095890, diff = 259015 seconds)
INFO  2023-05-07 11:24:28,910 [shard  0] gossip - InetAddress 172.20.245.195 is now UP, status = NORMAL
...
INFO  2023-05-07 11:25:23,064 [shard  0] storage_service - Checking remote features with gossip
INFO  2023-05-07 11:25:23,064 [shard  0] gossip - Gossip shadow round started with nodes={172.20.148.72, 172.20.245.195, 172.20.179.129}
...
WARN  2023-05-07 11:25:28,064 [shard  0] gossip - The get_endpoint_states verb to node 172.20.148.72 was timeout
INFO  2023-05-07 11:25:28,064 [shard  0] gossip - Gossip shadow round finished with nodes_talked={172.20.245.195, 172.20.179.129}
...
ERROR 2023-05-07 11:25:28,182 [shard  0] init - Startup failed: std::runtime_error (A node with address 172.20.212.218 already exists, cancelling join. Use replace_address if you want to replace this node.)
2023-05-07 11:25:28,317 INFO exited: scylla (exit status 1; not expected)

Services:

scylla               sct-cluster-client                           ClusterIP      172.20.150.72    <none> 
   7000/TCP,7001/TCP,7199/TCP,10001/TCP,9180/TCP,5090/TCP,9100/TCP,9042/TCP,9142/TCP,19042/TCP,19142/TCP,9160/TCP   179m    app.kubernetes.io/managed-by=scylla-operator,app.kubernetes.io/name=scylla,app=scylla,scylla/cluster=sct-cluster
scylla               sct-cluster-us-east1-b-us-east1-0            ClusterIP      172.20.179.129   <none> 
   7000/TCP,7001/TCP,7199/TCP,10001/TCP,9180/TCP,5090/TCP,9100/TCP,9042/TCP,9142/TCP,19042/TCP,19142/TCP,9160/TCP   179m    statefulset.kubernetes.io/pod-name=sct-cluster-us-east1-b-us-east1-0
scylla               sct-cluster-us-east1-b-us-east1-1            ClusterIP      172.20.245.195   <none> 
   7000/TCP,7001/TCP,7199/TCP,10001/TCP,9180/TCP,5090/TCP,9100/TCP,9042/TCP,9142/TCP,19042/TCP,19142/TCP,9160/TCP   151m    statefulset.kubernetes.io/pod-name=sct-cluster-us-east1-b-us-east1-1
scylla               sct-cluster-us-east1-b-us-east1-2            ClusterIP      172.20.212.218   <none> 
   7000/TCP,7001/TCP,7199/TCP,10001/TCP,9180/TCP,5090/TCP,9100/TCP,9042/TCP,9142/TCP,19042/TCP,19142/TCP,9160/TCP   144m    statefulset.kubernetes.io/pod-name=sct-cluster-us-east1-b-us-east1-2

To Reproduce Steps to reproduce the behavior:

  1. Deploy operator
  2. Deploy scylla with 3+ nodes
  3. Perform above mentioned steps
  4. See error

Expected behavior The addition of a new node must work.

Logs ci-job-link: https://jenkins.scylladb.com/job/scylla-operator/job/operator-master/job/functional/job/functional-eks/72 db-cluster-logs: https://cloudius-jenkins-test.s3.amazonaws.com/9dceedfb-66ee-4312-8480-3f654b13fd9a/20230507_134805/db-cluster-9dceedfb.tar.gz sct-runner-logs: https://cloudius-jenkins-test.s3.amazonaws.com/9dceedfb-66ee-4312-8480-3f654b13fd9a/20230507_134805/sct-9dceedfb.log.tar.gz k8s-cluster-logs: https://cloudius-jenkins-test.s3.amazonaws.com/9dceedfb-66ee-4312-8480-3f654b13fd9a/20230507_134805/kubernetes-9dceedfb.tar.gz

Environment:

  • Platform: EKS
  • Kubernetes version: v1.24.13-eks-0a21954
  • Scylla version: 2023.1.0-rc5
  • Scylla-operator version: v1.9.0-alpha.2-32-g7b1f723

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 20 (12 by maintainers)

Most upvoted comments

Opened issue in scylladb repo describing the root cause of the issue and providing a reproducer. https://github.com/scylladb/scylladb/issues/14991

Here’s what’s happening as I understood it.

The node that we decommissioned, earlier went through a replace operation:

INFO  2023-05-07 11:11:40,576 [shard  0] storage_service - Host 4d43a66a-414a-46de-b9ef-0e9349fb2e1d/172.20.186.151 is replacing ddeee333-e9cb-4ba2-a9ff-c0ce2f832ffe/172.20.148.72

The replaced node 172.20.148.72 is quarantined in gossiper. But for some reason that I understand, every two minutes:

  • the old IP gets unquarantined
  • handle_state_normal gets called for the old IP again
  • we notice a token conflict and again quarantine the old IP:
INFO  2023-05-07 11:14:40,663 [shard  0] gossip - 60000 ms elapsed, 172.20.148.72 gossip quarantine over
INFO  2023-05-07 11:14:40,817 [shard  0] storage_service - Set host_id=ddeee333-e9cb-4ba2-a9ff-c0ce2f832ffe to be owned by node=172.20.148.72
INFO  2023-05-07 11:14:40,817 [shard  0] storage_service - handle_state_normal: Nodes 172.20.148.72 and 172.20.186.151 have the same token -9008303828611985688. Ignoring 172.20.148.72

This is happening every 2 minutes over and over again. For some reason the old IP just doesn’t want to go away.

INFO  2023-05-07 11:16:42,677 [shard  0] gossip - 60000 ms elapsed, 172.20.148.72 gossip quarantine over
INFO  2023-05-07 11:16:42,825 [shard  0] storage_service - Set host_id=ddeee333-e9cb-4ba2-a9ff-c0ce2f832ffe to be owned by node=172.20.148.72
INFO  2023-05-07 11:16:42,826 [shard  0] storage_service - handle_state_normal: Nodes 172.20.148.72 and 172.20.186.151 have the same token -9008303828611985688. Ignoring 172.20.148.72

INFO  2023-05-07 11:18:44,699 [shard  0] gossip - 60000 ms elapsed, 172.20.148.72 gossip quarantine over
INFO  2023-05-07 11:18:44,844 [shard  0] storage_service - Set host_id=ddeee333-e9cb-4ba2-a9ff-c0ce2f832ffe to be owned by node=172.20.148.72
INFO  2023-05-07 11:18:44,844 [shard  0] storage_service - handle_state_normal: Nodes 172.20.148.72 and 172.20.186.151 have the same token -9008303828611985688. Ignoring 172.20.148.72

and so on.

cc @asias @bhalevy

Eventually, the new node 172.20.186.151 decommissions.

INFO  2023-05-07 11:21:24,742 [shard  0] storage_service - DECOMMISSIONING: starts
...
INFO  2023-05-07 11:21:56,645 [shard  0] storage_service - DECOMMISSIONING: done

and now the fun part. The old IP 172.20.148.72 gets resurrected again, obviously, as it was continuously getting every 2 minutes.

But this time, there is no new IP 172.20.186.151 with the same tokens so that other nodes can observe the conflict and quarantine the old IP again. Because 172.20.186.151 has just decommissioned.

So the old IP 172.20.148.72 gets resurrected as a NORMAL node 😃 (actually shutdown, but it’s pretty much the same thing; what matters is that it is now considered a peer, part of the token ring etc.)

INFO  2023-05-07 11:21:48,888 [shard  0] gossip - 60000 ms elapsed, 172.20.148.72 gossip quarantine over
INFO  2023-05-07 11:21:49,439 [shard  0] storage_service - Set host_id=ddeee333-e9cb-4ba2-a9ff-c0ce2f832ffe to be owned by node=172.20.148.72
INFO  2023-05-07 11:21:49,454 [shard  0] gossip - InetAddress 172.20.148.72 is now DOWN, status = shutdown

And we have 3 NORMAL members of the cluster, where there should be 2.

This looks like a serious bug in Scylla.

The question is: why can’t the node which got replaced GO AWAY? Why is it getting “resurrected” every 2 minutes? @bhalevy @asias any ideas?

As for the original issue: looking at the logs from https://cloudius-jenkins-test.s3.amazonaws.com/9dceedfb-66ee-4312-8480-3f654b13fd9a/20230507_134805/db-cluster-9dceedfb.tar.gz it’s clear that everything went fine until the number of replicas has been increased again. The node sct-cluster-us-east1-b-us-east1-2 (172.20.186.151) has decommissioned correctly and was created again with a new broadcast address 172.20.212.218. The bootstrap process looks fine until

INFO  2023-05-07 11:25:19,387 [shard  0] cdc - Inserting new generation data at UUID c3e0a98b-6e65-439d-a1
dd-da2808c552b1
INFO  2023-05-07 11:25:19,395 [shard  0] init - Shutting down group 0 service
INFO  2023-05-07 11:25:19,395 [shard  0] init - Shutting down group 0 service was successful
INFO  2023-05-07 11:25:19,395 [shard  0] init - Shutting down storage service notifications
INFO  2023-05-07 11:25:19,395 [shard  0] init - Shutting down storage service notifications was successful
INFO  2023-05-07 11:25:19,395 [shard  0] init - Shutting down system distributed keyspace
INFO  2023-05-07 11:25:19,396 [shard  0] init - Shutting down system distributed keyspace was successful
INFO  2023-05-07 11:25:19,396 [shard  0] init - Shutting down migration manager notifications
...

when it starts shutting down all of a sudden to then fail with the following error

ERROR 2023-05-07 11:25:19,794 [shard  0] init - Startup failed: exceptions::unavailable_exception (Cannot achieve consistency level for cl ALL. Requires 3, alive 2)

The process is then restarted by supervisord and then obviously fails with

ERROR 2023-05-07 11:25:28,182 [shard  0] init - Startup failed: std::runtime_error (A node with address 172.20.212.218 already exists, cancelling join. Use replace_address if you want to replace this node.)

The root cause of the issue seems to be on Scylla’s side then; there is no way we can prevent the above exception. And I believe it shouldn’t result in having to start again with replace_address if the bootstrap fails.

As for the cancelling join error: we could work around it by fixing https://github.com/scylladb/scylla-operator/issues/835 and always starting the nodes with replace_address_first_boot (we even had an attempt at it: https://github.com/scylladb/scylla-operator/pull/721; unfinished and probably outdated by now). https://github.com/scylladb/scylla-operator/issues/835 should be fixed when we have the container split and start the Scylla process ourselves. We’ll be moving away from IPs as identity soon though, so starting with replace_address_first_boot seems like a wasted effort now, and it won’t be possible with UUIDs unless we generate them ourselves. @tnozicka should we take any steps to mitigate this now?

On a side note, I wasn’t able to reproduce it in EKS myself.

fyi @vponomaryov