scylla-operator: Replaced node is not available by cql and prometheus ports right after the gossip quarantine overs for old IP address (happens only without SNI proxy)

Describe the bug Scenario:

  • Create 3-node Scylla cluster using scylla-operator
  • Delete VM that hosts one (last in the test run with index 2) of the Scylla K8S nodes
  • Mark the affected Scylla pod for the replacement
  • Wait for it’s join to the Scylla cluster

Affected pod logs:

INFO  2022-12-19 21:59:04,114 [shard  0] schema_tables - Creating keyspace keyspace1
INFO  2022-12-19 21:59:04,124 [shard  0] schema_tables - Schema version changed to 5a16863b-efb8-358f-ace8-0ede2e0904d9
INFO  2022-12-19 21:59:04,957 [shard  0] schema_tables - Creating keyspace1.standard1 id=5ab8bcb0-7fe8-11ed-9985-69dbc39b570f version=6e912d9b-41c2-3582-8903-36d96e0d4e8d
INFO  2022-12-19 21:59:04,964 [shard  0] schema_tables - Schema version changed to 19dbec6e-447c-3eb5-9c82-2802326db918

I1219 22:01:02.772084       1 orphanedpv/controller.go:283] "Observed deletion of Node" Node="ip-10-0-8-200.eu-north-1.compute.internal"
Error from server: Get "https://10.0.8.200:10250/containerLogs/scylla/sct-cluster-us-east1-b-us-east1-2/scylla?follow=true&sinceSeconds=30": dial tcp 10.0.8.200:10250: i/o timeout
Error from server (NotFound): pods "ip-10-0-8-200.eu-north-1.compute.internal" not found
Error from server (BadRequest): container "scylla" in pod "sct-cluster-us-east1-b-us-east1-2" is waiting to start: PodInitializing

I1219 22:04:30.230505       1 operator/sidecar.go:147] sidecar version "v1.8.0-alpha.0-162-g7be1034"
...
INFO  2022-12-19 22:06:41,324 [shard  0] storage_service - Node 172.20.136.49 state jump to normal
INFO  2022-12-19 22:06:41,324 [shard  0] storage_service - Set host_id=968276ba-5ee0-45b7-8c3d-2046925f72c1 to be owned by node=172.20.136.49
...
WARN  2022-12-19 22:06:41,391 [shard  0] migration_manager - Can't send migration request: node 172.20.136.49 is down.
...
INFO  2022-12-19 22:06:53,780 [shard  0] storage_service - entering JOINING mode
...
INFO  2022-12-19 22:06:53,800 [shard  0] storage_service - Checking bootstrapping/leaving nodes: ok
INFO  2022-12-19 22:06:53,801 [shard  0] storage_service - Sleeping before replacing 172.20.136.49...
...
WARN  2022-12-19 22:06:56,318 [shard  0] gossip - Fail to send EchoMessage to 172.20.136.49: seastar::rpc::timeout_error (rpc call timed out)
...
I1219 22:07:18.362942       1 sidecar/sync.go:135] "Successfully updated HostID annotation" Service="scylla/sct-cluster-us-east1-b-us-east1-2" HostID="968276ba-5ee0-45b7-8c3d-2046925f72c1"
...
INFO  2022-12-19 22:08:30,828 [shard  0] storage_service - Replace: removing 172.20.136.49 from group 0...
INFO  2022-12-19 22:08:30,828 [shard  0] raft_group_registry - remove_from_group0(172.20.136.49): local RAFT feature disabled, skipping.
INFO  2022-12-19 22:08:30,828 [shard  0] storage_service - Replace: 172.20.136.49 removed from group 0.
INFO  2022-12-19 22:08:30,828 [shard  0] storage_service - Starting to bootstrap...
INFO  2022-12-19 22:08:30,828 [shard  0] storage_service - replace[07e3565a-faa0-46c3-bba7-e20ed47bb7ad]: Started replace operation, replace_nodes={{ 172.20.136.49 -> 172.20.161.146}}, sync_nodes=[172.20.161.146, 172.20.76.129, 172.20.159.21], ignore_nodes=[]
INFO  2022-12-19 22:08:30,829 [shard  0] storage_service - replace[07e3565a-faa0-46c3-bba7-e20ed47bb7ad]: Added replacing_node=172.20.161.146 to replace existing_node=172.20.136.49, coordinator=172.20.161.146
INFO  2022-12-19 22:08:30,829 [shard  0] token_metadata - Added node 172.20.161.146 as pending replacing endpoint which replaces existing node 172.20.136.49
...
INFO  2022-12-19 22:08:45,186 [shard  8] compaction - [Reshape keyspace1.standard1 b48e5e10-7fe9-11ed-ba22-1f228330a7e6] Reshaping  ...
...
INFO  2022-12-19 22:08:46,356 [shard  3] repair - repair[3d15cd69-b580-45e5-9cf6-1177fde99d64]: Started to repair 1 out of 3 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=replace
...
INFO  2022-12-19 22:08:52,867 [shard  0] storage_service - Node 172.20.161.146 state jump to normal
WARN  2022-12-19 22:08:52,867 [shard  0] storage_service - Host ID collision for 968276ba-5ee0-45b7-8c3d-2046925f72c1 between 172.20.136.49 and 172.20.161.146; 172.20.161.146 is the new owner
INFO  2022-12-19 22:08:52,867 [shard  0] token_metadata - Removed node 172.20.161.146 as pending replacing endpoint which replaces existing node 172.20.136.49
INFO  2022-12-19 22:08:52,868 [shard  0] storage_service - Set host_id=968276ba-5ee0-45b7-8c3d-2046925f72c1 to be owned by node=172.20.161.146, existing=172.20.136.49
...
INFO  2022-12-19 22:08:52,895 [shard  0] storage_service - entering NORMAL mode
...
INFO  2022-12-19 22:09:04,929 [shard  0] init - serving
INFO  2022-12-19 22:09:04,929 [shard  0] init - Scylla version 5.1.1-0.20221208.1cfedc5b59f4 initialization completed.
INFO  2022-12-19 22:09:52,911 [shard  0] gossip - 60000 ms elapsed, 172.20.136.49 gossip quarantine over
INFO  2022-12-19 22:09:53,911 [shard  0] gossip - FatClient 172.20.136.49 has been silent for 30000ms, removing from gossip
INFO  2022-12-19 22:10:53,928 [shard  0] gossip - 60000 ms elapsed, 172.20.136.49 gossip quarantine over

Then, SCT/loader logs:

2022-12-19 22:10:13,409 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-12-19 22:10:13.408: (PrometheusAlertManagerEvent Severity.NORMAL) period_type=begin event_id=10bb991f-8104-402a-8b46-f53e63810d3d: alert_name=InstanceDown node=10.0.8.142 start=2022-12-19T22:10:04.791Z end=2022-12-19T22:14:04.791Z description=10.0.8.142 has been down for more than 30 seconds. updated=2022-12-19T22:10:09.713Z state=active fingerprint=5ef727f953f696fb labels={'alertname': 'InstanceDown', 'instance': '10.0.8.142', 'job': 'scylla', 'monitor': 'scylla-monitor', 'severity': '2'}
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING > [control connection] Error connecting to 10.0.8.142:9042: < t:2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING > [control connection] Error connecting to 10.0.8.142:9042:
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING > Traceback (most recent call last):
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING >   File "cassandra/cluster.py", line 3606, in cassandra.cluster.ControlConnection._reconnect_internal
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING >   File "cassandra/cluster.py", line 3628, in cassandra.cluster.ControlConnection._try_connect
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING >   File "cassandra/cluster.py", line 1686, in cassandra.cluster.Cluster.connection_factory
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING >   File "cassandra/connection.py", line 875, in cassandra.connection.Connection.factory
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING >   File "/usr/local/lib/python3.10/site-packages/cassandra/io/libevreactor.py", line 267, in __init__
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING >     self._connect_socket()
2022-12-19 22:10:59,415 f:cluster.py      l:3334 c:cassandra.cluster    p:WARNING >   File "cassandra/connection.py", line 954, in cassandra.connection.Connection._connect_socket

So, we can see that loader lost connectivity right after the old IP address was removed from gossip at 2022-12-19 22:09:53,911 . This problem was kept till the end of the test run.

During the test run we had the periodic health checks. And the last successful health check was at 2022-12-19 22:09:18:

2022-12-19 22:09:16,160 > Running command "/usr/bin/nodetool  status "...                           
2022-12-19 22:09:17,034 > total,        599174,    1000,    1000,    1000,     0.6,     0.5,     0.7,     5.6,    12.6,    16.5,  600.0,  0.02990      ,      0,      0,       0,       0,       0,       0    
2022-12-19 22:09:18,537 > Datacenter: us-east1-b                                                    
2022-12-19 22:09:18,537 > ======================                                                    
2022-12-19 22:09:18,537 > Status=Up/Down                                                            
2022-12-19 22:09:18,538 > |/ State=Normal/Leaving/Joining/Moving                                       
2022-12-19 22:09:18,547 > --  Address         Load       Tokens       Owns    Host ID                               Rack
2022-12-19 22:09:18,550 > UN  172.20.159.21   ?          256          ?       6367acd4-8c96-4842-b8a9-8305cbe1053a  us-east1  
2022-12-19 22:09:18,551 > UN  172.20.76.129   ?          256          ?       c647e4a6-d2b4-4f49-919c-922b282c74c0  us-east1  
2022-12-19 22:09:18,553 > UN  172.20.161.146  134.32 MB  256          ?       968276ba-5ee0-45b7-8c3d-2046925f72c1  us-east1  
2022-12-19 22:09:18,553 >                                                                             
2022-12-19 22:09:18,553 > Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
2022-12-19 22:09:18,599 > Command "/usr/bin/nodetool  status " finished with status 0               
2022-12-19 22:09:18,599 > Node sct-cluster-us-east1-b-us-east1-2 [10.0.8.200 | 10.0.11.210] (seed: False): Command '/usr/bin/nodetool  status ' duration -> 2.4390276030003406 s

Then we were getting the following error constantly retrying to get the peers info as part of that health check:

2022-12-19 22:24:40,165 f:decorators.py   l:69   c:sdcm.utils.decorators p:DEBUG > 'get_peers_info': failed with 'NoHostAvailable('Unable to connect to any servers', {'10.0.8.142:9042': OSError(None, "Tried connecting to [('10.0.8.142', 9042)]. Last error: timed out")})', retrying [#0]

Also python driver attempts:

2022-12-19 22:14:49,604 f:decorators.py   l:69   c:sdcm.utils.decorators p:DEBUG > 'cql_connection_patient_exclusive': failed with 'NoHostAvailable('Unable to connect to any servers', {'10.0.8.142:9042': OSError(None, "Tried connecting to [('10.0.8.142', 9042)]. Last error: timed out")})', retrying [#2]

Then at 2022-12-19 23:31:32 we get successful health check right after the next round of the scenario happens where we delete the same K8S node again:

2022-12-19 23:26:30,338 f:health_checker.py l:41   c:sdcm.utils.health_checker p:DEBUG > Status for target node sct-cluster-us-east1-b-us-east1-2
2022-12-19 23:26:30,338 f:health_checker.py l:140  c:sdcm.utils.health_checker p:WARNING > SYSTEM.PEERS info is not availble. Search for the warning above. Verify schema version can't be performed 
2022-12-19 23:26:30,338 > Node `sct-cluster-us-east1-b-us-east1-2' is healthy                       
2022-12-19 23:31:30,451 > Running command "/usr/bin/nodetool  status "...                           
2022-12-19 23:31:30.450: (ClusterHealthValidatorEvent Severity.NORMAL) period_type=begin event_id=c035fefe-292c-4c87-8f9e-5f69eb19d155
2022-12-19 23:31:31,008 > total,       5504774,    1000,    1000,    1000,     0.5,     0.5,     0.7,     0.7,     2.6,    12.9, 5505.0,  0.00520,      0,      0,       0,       0,       0,       0 
2022-12-19 23:31:32,008 > total,       5534174,    1000,    1000,    1000,     0.6,     0.5,     0.7,     1.2,    12.0,    16.0, 5535.0,  0.00310,      0,      0,       0,       0,       0,       0
2022-12-19 23:31:32,355 > Datacenter: us-east1-b                                                    
2022-12-19 23:31:32,356 > ======================                                                    
2022-12-19 23:31:32,356 > Status=Up/Down                                                            
2022-12-19 23:31:32,356 > |/ State=Normal/Leaving/Joining/Moving                                    
2022-12-19 23:31:32,360 > --  Address         Load       Tokens       Owns    Host ID                               Rack
2022-12-19 23:31:32,366 > UN  172.20.159.21   1.08 MB    256          ?       6367acd4-8c96-4842-b8a9-8305cbe1053a  us-east1
2022-12-19 23:31:32,368 > UN  172.20.76.129   39.92 MB   256          ?       c647e4a6-d2b4-4f49-919c-922b282c74c0  us-east1  
2022-12-19 23:31:32,369 > UN  172.20.161.146  134.32 MB  256          ?       968276ba-5ee0-45b7-8c3d-2046925f72c1  us-east1  
2022-12-19 23:31:32,369 >                                                                             
2022-12-19 23:31:32,369 > Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
2022-12-19 23:31:32,392 > Command "/usr/bin/nodetool  status " finished with status 0

After it everything repeats to be missbehaving.

Note: if we decommission affected node and then add new one instead of the using the replace node operation then we do not have such a problem.

To Reproduce Steps to reproduce the behavior:

  1. Deploy 3-node Scylla cluster
  2. Delete VM that hosts a K8S Scylla pod
  3. Mark the affected pod for replacement
  4. See error

Expected behavior Connectivity must work for the new node that replaced the old one.

Logs CI job: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-longevity-scylla-operator-3h-eks-terminate-replace/2 K8S logs: https://cloudius-jenkins-test.s3.amazonaws.com/aa03c3cd-4751-41a9-aa90-4dcacd09379e/20221220_014120/kubernetes-aa03c3cd.tar.gz Loader logs: https://cloudius-jenkins-test.s3.amazonaws.com/aa03c3cd-4751-41a9-aa90-4dcacd09379e/20221220_014120/loader-set-aa03c3cd.tar.gz DB logs: https://cloudius-jenkins-test.s3.amazonaws.com/aa03c3cd-4751-41a9-aa90-4dcacd09379e/20221220_014120/db-cluster-aa03c3cd.tar.gz SCT (Test runner) logs: https://cloudius-jenkins-test.s3.amazonaws.com/aa03c3cd-4751-41a9-aa90-4dcacd09379e/20221220_014120/sct-runner-aa03c3cd.tar.gz

Environment:

  • Platform: EKS
  • Kubernetes version: v1.22.15
  • Scylla version: 5.1.1
  • Scylla-operator version: e.g.: v1.8.0-alpha.0-162-g7be1034

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 18 (11 by maintainers)

Commits related to this issue

Most upvoted comments

@vponomaryov

take https://github.com/scylladb/scylla-cluster-tests/pull/5627, and see if it fixes the issue we are seeing here.