scylla-operator: Replaced node is not available by cql and prometheus ports right after the gossip quarantine overs for old IP address (happens only without SNI proxy)

Describe the bug Scenario:

Create 3-node Scylla cluster using scylla-operator
Delete VM that hosts one (last in the test run with index 2) of the Scylla K8S nodes
Mark the affected Scylla pod for the replacement
Wait for it’s join to the Scylla cluster

Affected pod logs:

INFO  2022-12-19 21:59:04,114 [shard  0] schema_tables - Creating keyspace keyspace1
INFO  2022-12-19 21:59:04,124 [shard  0] schema_tables - Schema version changed to 5a16863b-efb8-358f-ace8-0ede2e0904d9
INFO  2022-12-19 21:59:04,957 [shard  0] schema_tables - Creating keyspace1.standard1 id=5ab8bcb0-7fe8-11ed-9985-69dbc39b570f version=6e912d9b-41c2-3582-8903-36d96e0d4e8d
INFO  2022-12-19 21:59:04,964 [shard  0] schema_tables - Schema version changed to 19dbec6e-447c-3eb5-9c82-2802326db918

I1219 22:01:02.772084       1 orphanedpv/controller.go:283] "Observed deletion of Node" Node="ip-10-0-8-200.eu-north-1.compute.internal"
Error from server: Get "https://10.0.8.200:10250/containerLogs/scylla/sct-cluster-us-east1-b-us-east1-2/scylla?follow=true&sinceSeconds=30": dial tcp 10.0.8.200:10250: i/o timeout
Error from server (NotFound): pods "ip-10-0-8-200.eu-north-1.compute.internal" not found
Error from server (BadRequest): container "scylla" in pod "sct-cluster-us-east1-b-us-east1-2" is waiting to start: PodInitializing

I1219 22:04:30.230505       1 operator/sidecar.go:147] sidecar version "v1.8.0-alpha.0-162-g7be1034"
...
INFO  2022-12-19 22:06:41,324 [shard  0] storage_service - Node 172.20.136.49 state jump to normal
INFO  2022-12-19 22:06:41,324 [shard  0] storage_service - Set host_id=968276ba-5ee0-45b7-8c3d-2046925f72c1 to be owned by node=172.20.136.49
...
WARN  2022-12-19 22:06:41,391 [shard  0] migration_manager - Can't send migration request: node 172.20.136.49 is down.
...
INFO  2022-12-19 22:06:53,780 [shard  0] storage_service - entering JOINING mode
...
INFO  2022-12-19 22:06:53,800 [shard  0] storage_service - Checking bootstrapping/leaving nodes: ok
INFO  2022-12-19 22:06:53,801 [shard  0] storage_service - Sleeping before replacing 172.20.136.49...
...
WARN  2022-12-19 22:06:56,318 [shard  0] gossip - Fail to send EchoMessage to 172.20.136.49: seastar::rpc::timeout_error (rpc call timed out)
...
I1219 22:07:18.362942       1 sidecar/sync.go:135] "Successfully updated HostID annotation" Service="scylla/sct-cluster-us-east1-b-us-east1-2" HostID="968276ba-5ee0-45b7-8c3d-2046925f72c1"
...
INFO  2022-12-19 22:08:30,828 [shard  0] storage_service - Replace: removing 172.20.136.49 from group 0...
INFO  2022-12-19 22:08:30,828 [shard  0] raft_group_registry - remove_from_group0(172.20.136.49): local RAFT feature disabled, skipping.
INFO  2022-12-19 22:08:30,828 [shard  0] storage_service - Replace: 172.20.136.49 removed from group 0.
INFO  2022-12-19 22:08:30,828 [shard  0] storage_service - Starting to bootstrap...
INFO  2022-12-19 22:08:30,828 [shard  0] storage_service - replace[07e3565a-faa0-46c3-bba7-e20ed47bb7ad]: Started replace operation, replace_nodes={{ 172.20.136.49 -> 172.20.161.146}}, sync_nodes=[172.20.161.146, 172.20.76.129, 172.20.159.21], ignore_nodes=[]
INFO  2022-12-19 22:08:30,829 [shard  0] storage_service - replace[07e3565a-faa0-46c3-bba7-e20ed47bb7ad]: Added replacing_node=172.20.161.146 to replace existing_node=172.20.136.49, coordinator=172.20.161.146
INFO  2022-12-19 22:08:30,829 [shard  0] token_metadata - Added node 172.20.161.146 as pending replacing endpoint which replaces existing node 172.20.136.49
...
INFO  2022-12-19 22:08:45,186 [shard  8] compaction - [Reshape keyspace1.standard1 b48e5e10-7fe9-11ed-ba22-1f228330a7e6] Reshaping  ...
...
INFO  2022-12-19 22:08:46,356 [shard  3] repair - repair[3d15cd69-b580-45e5-9cf6-1177fde99d64]: Started to repair 1 out of 3 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=replace
...
INFO  2022-12-19 22:08:52,867 [shard  0] storage_service - Node 172.20.161.146 state jump to normal
WARN  2022-12-19 22:08:52,867 [shard  0] storage_service - Host ID collision for 968276ba-5ee0-45b7-8c3d-2046925f72c1 between 172.20.136.49 and 172.20.161.146; 172.20.161.146 is the new owner
INFO  2022-12-19 22:08:52,867 [shard  0] token_metadata - Removed node 172.20.161.146 as pending replacing endpoint which replaces existing node 172.20.136.49
INFO  2022-12-19 22:08:52,868 [shard  0] storage_service - Set host_id=968276ba-5ee0-45b7-8c3d-2046925f72c1 to be owned by node=172.20.161.146, existing=172.20.136.49
...
INFO  2022-12-19 22:08:52,895 [shard  0] storage_service - entering NORMAL mode
...
INFO  2022-12-19 22:09:04,929 [shard  0] init - serving
INFO  2022-12-19 22:09:04,929 [shard  0] init - Scylla version 5.1.1-0.20221208.1cfedc5b59f4 initialization completed.
INFO  2022-12-19 22:09:52,911 [shard  0] gossip - 60000 ms elapsed, 172.20.136.49 gossip quarantine over
INFO  2022-12-19 22:09:53,911 [shard  0] gossip - FatClient 172.20.136.49 has been silent for 30000ms, removing from gossip
INFO  2022-12-19 22:10:53,928 [shard  0] gossip - 60000 ms elapsed, 172.20.136.49 gossip quarantine over

Then, SCT/loader logs:

2022-12-19 22:10:13,409 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-12-19 22:10:13.408: (PrometheusAlertManagerEvent Severity.NORMAL) period_type=begin event_id=10bb991f-8104-402a-8b46-f53e63810d3d: alert_name=InstanceDown node=10.0.8.142 start=2022-12-19T22:10:04.791Z end=2022-12-19T22:14:04.791Z description=10.0.8.142 has been down for more than 30 seconds. updated=2022-12-19T22:10:09.713Z state=active fingerprint=5ef727f953f696fb labels={'alertname': 'InstanceDown', 'instance': '10.0.8.142', 'job': 'scylla', 'monitor': 'scylla-monitor', 'severity': '2'}
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING > [control connection] Error connecting to 10.0.8.142:9042: < t:2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING > [control connection] Error connecting to 10.0.8.142:9042:
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING > Traceback (most recent call last):
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING >   File "cassandra/cluster.py", line 3606, in cassandra.cluster.ControlConnection._reconnect_internal
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING >   File "cassandra/cluster.py", line 3628, in cassandra.cluster.ControlConnection._try_connect
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING >   File "cassandra/cluster.py", line 1686, in cassandra.cluster.Cluster.connection_factory
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING >   File "cassandra/connection.py", line 875, in cassandra.connection.Connection.factory
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING >   File "/usr/local/lib/python3.10/site-packages/cassandra/io/libevreactor.py", line 267, in __init__
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING >     self._connect_socket()
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING >   File "cassandra/connection.py", line 954, in cassandra.connection.Connection._connect_socket

So, we can see that loader lost connectivity right after the old IP address was removed from gossip at 2022-12-19 22:09:53,911 . This problem was kept till the end of the test run.

During the test run we had the periodic health checks. And the last successful health check was at 2022-12-19 22:09:18:

2022-12-19 22:09:16,160 > Running command "/usr/bin/nodetool  status "...                           
2022-12-19 22:09:17,034 > total,        599174,    1000,    1000,    1000,     0.6,     0.5,     0.7,     5.6,    12.6,    16.5,  600.0,  0.02990      ,      0,      0,       0,       0,       0,       0    
2022-12-19 22:09:18,537 > Datacenter: us-east1-b                                                    
2022-12-19 22:09:18,537 > ======================                                                    
2022-12-19 22:09:18,537 > Status=Up/Down                                                            
2022-12-19 22:09:18,538 > |/ State=Normal/Leaving/Joining/Moving                                       
2022-12-19 22:09:18,547 > --  Address         Load       Tokens       Owns    Host ID                               Rack
2022-12-19 22:09:18,550 > UN  172.20.159.21   ?          256          ?       6367acd4-8c96-4842-b8a9-8305cbe1053a  us-east1  
2022-12-19 22:09:18,551 > UN  172.20.76.129   ?          256          ?       c647e4a6-d2b4-4f49-919c-922b282c74c0  us-east1  
2022-12-19 22:09:18,553 > UN  172.20.161.146  134.32 MB  256          ?       968276ba-5ee0-45b7-8c3d-2046925f72c1  us-east1  
2022-12-19 22:09:18,553 >                                                                             
2022-12-19 22:09:18,553 > Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
2022-12-19 22:09:18,599 > Command "/usr/bin/nodetool  status " finished with status 0               
2022-12-19 22:09:18,599 > Node sct-cluster-us-east1-b-us-east1-2 [10.0.8.200 | 10.0.11.210] (seed: False): Command '/usr/bin/nodetool  status ' duration -> 2.4390276030003406 s

Then we were getting the following error constantly retrying to get the peers info as part of that health check:

2022-12-19 22:24:40,165 f:decorators.py   l:69   c:sdcm.utils.decorators p:DEBUG > 'get_peers_info': failed with 'NoHostAvailable('Unable to connect to any servers', {'10.0.8.142:9042': OSError(None, "Tried connecting to [('10.0.8.142', 9042)]. Last error: timed out")})', retrying [#0]

Also python driver attempts:

2022-12-19 22:14:49,604 f:decorators.py   l:69   c:sdcm.utils.decorators p:DEBUG > 'cql_connection_patient_exclusive': failed with 'NoHostAvailable('Unable to connect to any servers', {'10.0.8.142:9042': OSError(None, "Tried connecting to [('10.0.8.142', 9042)]. Last error: timed out")})', retrying [#2]

Then at 2022-12-19 23:31:32 we get successful health check right after the next round of the scenario happens where we delete the same K8S node again:

2022-12-19 23:26:30,338 f:health_checker.py l:41   c:sdcm.utils.health_checker p:DEBUG > Status for target node sct-cluster-us-east1-b-us-east1-2
2022-12-19 23:26:30,338 f:health_checker.py l:140  c:sdcm.utils.health_checker p:WARNING > SYSTEM.PEERS info is not availble. Search for the warning above. Verify schema version can't be performed 
2022-12-19 23:26:30,338 > Node `sct-cluster-us-east1-b-us-east1-2' is healthy                       
2022-12-19 23:31:30,451 > Running command "/usr/bin/nodetool  status "...                           
2022-12-19 23:31:30.450: (ClusterHealthValidatorEvent Severity.NORMAL) period_type=begin event_id=c035fefe-292c-4c87-8f9e-5f69eb19d155
2022-12-19 23:31:31,008 > total,       5504774,    1000,    1000,    1000,     0.5,     0.5,     0.7,     0.7,     2.6,    12.9, 5505.0,  0.00520,      0,      0,       0,       0,       0,       0 
2022-12-19 23:31:32,008 > total,       5534174,    1000,    1000,    1000,     0.6,     0.5,     0.7,     1.2,    12.0,    16.0, 5535.0,  0.00310,      0,      0,       0,       0,       0,       0
2022-12-19 23:31:32,355 > Datacenter: us-east1-b                                                    
2022-12-19 23:31:32,356 > ======================                                                    
2022-12-19 23:31:32,356 > Status=Up/Down                                                            
2022-12-19 23:31:32,356 > |/ State=Normal/Leaving/Joining/Moving                                    
2022-12-19 23:31:32,360 > --  Address         Load       Tokens       Owns    Host ID                               Rack
2022-12-19 23:31:32,366 > UN  172.20.159.21   1.08 MB    256          ?       6367acd4-8c96-4842-b8a9-8305cbe1053a  us-east1
2022-12-19 23:31:32,368 > UN  172.20.76.129   39.92 MB   256          ?       c647e4a6-d2b4-4f49-919c-922b282c74c0  us-east1  
2022-12-19 23:31:32,369 > UN  172.20.161.146  134.32 MB  256          ?       968276ba-5ee0-45b7-8c3d-2046925f72c1  us-east1  
2022-12-19 23:31:32,369 >                                                                             
2022-12-19 23:31:32,369 > Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
2022-12-19 23:31:32,392 > Command "/usr/bin/nodetool  status " finished with status 0

After it everything repeats to be missbehaving.

Note: if we decommission affected node and then add new one instead of the using the replace node operation then we do not have such a problem.

To Reproduce Steps to reproduce the behavior: