scylla-operator: `add node` operation fails with the `already exists, cancelling join` error if a node with it's name was drained, recreated and then decommissioned
Describe the bug The following nemesis scenario was run in the setup:
- Drain a K8S Scylla node with
kubectl drain %node-name% --ignore-daemonsets --delete-local-data
command. - Uncordon that node
- Decrease the replica number by 1 in the scyllacluster object configuration
- Wait for the pod deletion
- Increase the replica number by 1 in the scyllacluster object configuration
- Wait for the pod readiness
So, we see following in the logs:
functional_tests/scylla_operator/test_functional.py::test_drain_kubernetes_node_then_decommission_and_add_scylla_node < t:2023-05-07 11:19:55,884 f:cluster.py l:3697 c:sdcm.cluster p:INFO > EksScyllaPodCluster functional-master-db-cluster-9dceedfb | Namespace: scylla: (1/3) nodes ready, node Node sct-cluster-us-east1-b-us-east1-0 [10.0.1.230 | 10.0.2.151] (seed: False). Time elapsed: 1 s
...
14:20:13 < t:2023-05-07 11:20:13,177 f:test_functional.py l:240 c:root p:INFO > Drain K8S node that hosts 'Node sct-cluster-us-east1-b-us-east1-2 [10.0.0.82 | 10.0.0.84] (seed: False)' scylla node not waiting for pod absence
...
14:20:29 < t:2023-05-07 11:20:28,438 f:__init__.py l:815 c:sdcm.utils.k8s p:INFO > 'scylla/sct-cluster-us-east1-b-us-east1-2' node has changed it's pod IP address from '10.0.0.84' to '10.0.2.232'. All old IPs: 10.0.1.181, 10.0.0.149, 10.0.3.201, 10.0.0.84
...
INFO 2023-05-07 11:19:34,721 [shard 0] token_metadata - Removed node 172.20.245.195 as pending replacing endpoint which replaces existing node 172.20.25.219
...
INFO 2023-05-07 11:20:13,781 [shard 0] storage_service - Node 172.20.186.151 state jump to normal
...
2023-05-07 11:20:20,515 INFO stopped: rsyslog (exit status 0)
I0507 11:20:20.551734 1 operator/sidecar.go:438] "Scylla process finished"
...
INFO 2023-05-07 11:21:24,737 [shard 9] api - decommission
INFO 2023-05-07 11:21:24,742 [shard 0] storage_service - DECOMMISSIONING: starts
INFO 2023-05-07 11:21:24,743 [shard 0] storage_service - decommission[9eda3ecf-a8b8-4e0e-b241-f65674b86fd3]: Started decommission operation, removing node=172.20.186.151, sync_nodes=[172.20.186.151, 172.20.245.195, 172.20.179.129], ignore_nodes=[]
...
INFO 2023-05-07 11:21:24,879 [shard 0] gossip - Node 172.20.186.151 will be removed from gossip at [2023-05-10 11:21:24]: (expire = 1683717684878095453, now = 1683458484879194748, diff = 259199 seconds)
...
INFO 2023-05-07 11:21:56,645 [shard 0] storage_service - DECOMMISSIONING: done
...
2023-05-07 11:22:03,703 INFO stopped: rsyslog (exit status 0)
I0507 11:22:21.312534 1 operator/sidecar.go:147] sidecar version "v1.9.0-alpha.2-32-g7b1f723"
...
INFO 2023-05-07 11:24:28,906 [shard 0] gossip - InetAddress 172.20.186.151 is now DOWN, status = LEFT
INFO 2023-05-07 11:24:28,908 [shard 0] gossip - InetAddress 172.20.179.129 is now UP, status = NORMAL
INFO 2023-05-07 11:24:28,909 [shard 0] gossip - Node 172.20.186.151 will be removed from gossip at [2023-05-10 11:21:24]: (expire = 1683717684878095453, now = 1683458668909095890, diff = 259015 seconds)
INFO 2023-05-07 11:24:28,910 [shard 0] gossip - InetAddress 172.20.245.195 is now UP, status = NORMAL
...
INFO 2023-05-07 11:25:23,064 [shard 0] storage_service - Checking remote features with gossip
INFO 2023-05-07 11:25:23,064 [shard 0] gossip - Gossip shadow round started with nodes={172.20.148.72, 172.20.245.195, 172.20.179.129}
...
WARN 2023-05-07 11:25:28,064 [shard 0] gossip - The get_endpoint_states verb to node 172.20.148.72 was timeout
INFO 2023-05-07 11:25:28,064 [shard 0] gossip - Gossip shadow round finished with nodes_talked={172.20.245.195, 172.20.179.129}
...
ERROR 2023-05-07 11:25:28,182 [shard 0] init - Startup failed: std::runtime_error (A node with address 172.20.212.218 already exists, cancelling join. Use replace_address if you want to replace this node.)
2023-05-07 11:25:28,317 INFO exited: scylla (exit status 1; not expected)
Services:
scylla sct-cluster-client ClusterIP 172.20.150.72 <none>
7000/TCP,7001/TCP,7199/TCP,10001/TCP,9180/TCP,5090/TCP,9100/TCP,9042/TCP,9142/TCP,19042/TCP,19142/TCP,9160/TCP 179m app.kubernetes.io/managed-by=scylla-operator,app.kubernetes.io/name=scylla,app=scylla,scylla/cluster=sct-cluster
scylla sct-cluster-us-east1-b-us-east1-0 ClusterIP 172.20.179.129 <none>
7000/TCP,7001/TCP,7199/TCP,10001/TCP,9180/TCP,5090/TCP,9100/TCP,9042/TCP,9142/TCP,19042/TCP,19142/TCP,9160/TCP 179m statefulset.kubernetes.io/pod-name=sct-cluster-us-east1-b-us-east1-0
scylla sct-cluster-us-east1-b-us-east1-1 ClusterIP 172.20.245.195 <none>
7000/TCP,7001/TCP,7199/TCP,10001/TCP,9180/TCP,5090/TCP,9100/TCP,9042/TCP,9142/TCP,19042/TCP,19142/TCP,9160/TCP 151m statefulset.kubernetes.io/pod-name=sct-cluster-us-east1-b-us-east1-1
scylla sct-cluster-us-east1-b-us-east1-2 ClusterIP 172.20.212.218 <none>
7000/TCP,7001/TCP,7199/TCP,10001/TCP,9180/TCP,5090/TCP,9100/TCP,9042/TCP,9142/TCP,19042/TCP,19142/TCP,9160/TCP 144m statefulset.kubernetes.io/pod-name=sct-cluster-us-east1-b-us-east1-2
To Reproduce Steps to reproduce the behavior:
- Deploy operator
- Deploy scylla with 3+ nodes
- Perform above mentioned steps
- See error
Expected behavior The addition of a new node must work.
Logs ci-job-link: https://jenkins.scylladb.com/job/scylla-operator/job/operator-master/job/functional/job/functional-eks/72 db-cluster-logs: https://cloudius-jenkins-test.s3.amazonaws.com/9dceedfb-66ee-4312-8480-3f654b13fd9a/20230507_134805/db-cluster-9dceedfb.tar.gz sct-runner-logs: https://cloudius-jenkins-test.s3.amazonaws.com/9dceedfb-66ee-4312-8480-3f654b13fd9a/20230507_134805/sct-9dceedfb.log.tar.gz k8s-cluster-logs: https://cloudius-jenkins-test.s3.amazonaws.com/9dceedfb-66ee-4312-8480-3f654b13fd9a/20230507_134805/kubernetes-9dceedfb.tar.gz
Environment:
- Platform: EKS
- Kubernetes version: v1.24.13-eks-0a21954
- Scylla version: 2023.1.0-rc5
- Scylla-operator version: v1.9.0-alpha.2-32-g7b1f723
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 20 (12 by maintainers)
Opened issue in scylladb repo describing the root cause of the issue and providing a reproducer. https://github.com/scylladb/scylladb/issues/14991
Here’s what’s happening as I understood it.
The node that we decommissioned, earlier went through a replace operation:
The replaced node
172.20.148.72
is quarantined in gossiper. But for some reason that I understand, every two minutes:handle_state_normal
gets called for the old IP againThis is happening every 2 minutes over and over again. For some reason the old IP just doesn’t want to go away.
…
and so on.
cc @asias @bhalevy
Eventually, the new node
172.20.186.151
decommissions.and now the fun part. The old IP
172.20.148.72
gets resurrected again, obviously, as it was continuously getting every 2 minutes.But this time, there is no new IP
172.20.186.151
with the same tokens so that other nodes can observe the conflict and quarantine the old IP again. Because172.20.186.151
has just decommissioned.So the old IP
172.20.148.72
gets resurrected as a NORMAL node 😃 (actuallyshutdown
, but it’s pretty much the same thing; what matters is that it is now considered a peer, part of the token ring etc.)And we have 3 NORMAL members of the cluster, where there should be 2.
This looks like a serious bug in Scylla.
The question is: why can’t the node which got replaced GO AWAY? Why is it getting “resurrected” every 2 minutes? @bhalevy @asias any ideas?
As for the original issue: looking at the logs from
https://cloudius-jenkins-test.s3.amazonaws.com/9dceedfb-66ee-4312-8480-3f654b13fd9a/20230507_134805/db-cluster-9dceedfb.tar.gz
it’s clear that everything went fine until the number of replicas has been increased again. The nodesct-cluster-us-east1-b-us-east1-2
(172.20.186.151) has decommissioned correctly and was created again with a new broadcast address 172.20.212.218. The bootstrap process looks fine untilwhen it starts shutting down all of a sudden to then fail with the following error
The process is then restarted by supervisord and then obviously fails with
The root cause of the issue seems to be on Scylla’s side then; there is no way we can prevent the above exception. And I believe it shouldn’t result in having to start again with
replace_address
if the bootstrap fails.As for the
cancelling join
error: we could work around it by fixing https://github.com/scylladb/scylla-operator/issues/835 and always starting the nodes withreplace_address_first_boot
(we even had an attempt at it: https://github.com/scylladb/scylla-operator/pull/721; unfinished and probably outdated by now). https://github.com/scylladb/scylla-operator/issues/835 should be fixed when we have the container split and start the Scylla process ourselves. We’ll be moving away from IPs as identity soon though, so starting withreplace_address_first_boot
seems like a wasted effort now, and it won’t be possible with UUIDs unless we generate them ourselves. @tnozicka should we take any steps to mitigate this now?On a side note, I wasn’t able to reproduce it in EKS myself.
fyi @vponomaryov
@rzetelskik filed the bug against the core: https://github.com/scylladb/scylladb/issues/14184