scylla-operator: Replaced node is not available by cql and prometheus ports right after the gossip quarantine overs for old IP address (happens only without SNI proxy)
Describe the bug Scenario:
- Create 3-node Scylla cluster using scylla-operator
- Delete VM that hosts one (last in the test run with index
2
) of the Scylla K8S nodes - Mark the affected Scylla pod for the replacement
- Wait for it’s join to the Scylla cluster
Affected pod logs:
INFO 2022-12-19 21:59:04,114 [shard 0] schema_tables - Creating keyspace keyspace1
INFO 2022-12-19 21:59:04,124 [shard 0] schema_tables - Schema version changed to 5a16863b-efb8-358f-ace8-0ede2e0904d9
INFO 2022-12-19 21:59:04,957 [shard 0] schema_tables - Creating keyspace1.standard1 id=5ab8bcb0-7fe8-11ed-9985-69dbc39b570f version=6e912d9b-41c2-3582-8903-36d96e0d4e8d
INFO 2022-12-19 21:59:04,964 [shard 0] schema_tables - Schema version changed to 19dbec6e-447c-3eb5-9c82-2802326db918
I1219 22:01:02.772084 1 orphanedpv/controller.go:283] "Observed deletion of Node" Node="ip-10-0-8-200.eu-north-1.compute.internal"
Error from server: Get "https://10.0.8.200:10250/containerLogs/scylla/sct-cluster-us-east1-b-us-east1-2/scylla?follow=true&sinceSeconds=30": dial tcp 10.0.8.200:10250: i/o timeout
Error from server (NotFound): pods "ip-10-0-8-200.eu-north-1.compute.internal" not found
Error from server (BadRequest): container "scylla" in pod "sct-cluster-us-east1-b-us-east1-2" is waiting to start: PodInitializing
I1219 22:04:30.230505 1 operator/sidecar.go:147] sidecar version "v1.8.0-alpha.0-162-g7be1034"
...
INFO 2022-12-19 22:06:41,324 [shard 0] storage_service - Node 172.20.136.49 state jump to normal
INFO 2022-12-19 22:06:41,324 [shard 0] storage_service - Set host_id=968276ba-5ee0-45b7-8c3d-2046925f72c1 to be owned by node=172.20.136.49
...
WARN 2022-12-19 22:06:41,391 [shard 0] migration_manager - Can't send migration request: node 172.20.136.49 is down.
...
INFO 2022-12-19 22:06:53,780 [shard 0] storage_service - entering JOINING mode
...
INFO 2022-12-19 22:06:53,800 [shard 0] storage_service - Checking bootstrapping/leaving nodes: ok
INFO 2022-12-19 22:06:53,801 [shard 0] storage_service - Sleeping before replacing 172.20.136.49...
...
WARN 2022-12-19 22:06:56,318 [shard 0] gossip - Fail to send EchoMessage to 172.20.136.49: seastar::rpc::timeout_error (rpc call timed out)
...
I1219 22:07:18.362942 1 sidecar/sync.go:135] "Successfully updated HostID annotation" Service="scylla/sct-cluster-us-east1-b-us-east1-2" HostID="968276ba-5ee0-45b7-8c3d-2046925f72c1"
...
INFO 2022-12-19 22:08:30,828 [shard 0] storage_service - Replace: removing 172.20.136.49 from group 0...
INFO 2022-12-19 22:08:30,828 [shard 0] raft_group_registry - remove_from_group0(172.20.136.49): local RAFT feature disabled, skipping.
INFO 2022-12-19 22:08:30,828 [shard 0] storage_service - Replace: 172.20.136.49 removed from group 0.
INFO 2022-12-19 22:08:30,828 [shard 0] storage_service - Starting to bootstrap...
INFO 2022-12-19 22:08:30,828 [shard 0] storage_service - replace[07e3565a-faa0-46c3-bba7-e20ed47bb7ad]: Started replace operation, replace_nodes={{ 172.20.136.49 -> 172.20.161.146}}, sync_nodes=[172.20.161.146, 172.20.76.129, 172.20.159.21], ignore_nodes=[]
INFO 2022-12-19 22:08:30,829 [shard 0] storage_service - replace[07e3565a-faa0-46c3-bba7-e20ed47bb7ad]: Added replacing_node=172.20.161.146 to replace existing_node=172.20.136.49, coordinator=172.20.161.146
INFO 2022-12-19 22:08:30,829 [shard 0] token_metadata - Added node 172.20.161.146 as pending replacing endpoint which replaces existing node 172.20.136.49
...
INFO 2022-12-19 22:08:45,186 [shard 8] compaction - [Reshape keyspace1.standard1 b48e5e10-7fe9-11ed-ba22-1f228330a7e6] Reshaping ...
...
INFO 2022-12-19 22:08:46,356 [shard 3] repair - repair[3d15cd69-b580-45e5-9cf6-1177fde99d64]: Started to repair 1 out of 3 tables in keyspace=system_auth, table=roles, table_id=5bc52802-de25-35ed-aeab-188eecebb090, repair_reason=replace
...
INFO 2022-12-19 22:08:52,867 [shard 0] storage_service - Node 172.20.161.146 state jump to normal
WARN 2022-12-19 22:08:52,867 [shard 0] storage_service - Host ID collision for 968276ba-5ee0-45b7-8c3d-2046925f72c1 between 172.20.136.49 and 172.20.161.146; 172.20.161.146 is the new owner
INFO 2022-12-19 22:08:52,867 [shard 0] token_metadata - Removed node 172.20.161.146 as pending replacing endpoint which replaces existing node 172.20.136.49
INFO 2022-12-19 22:08:52,868 [shard 0] storage_service - Set host_id=968276ba-5ee0-45b7-8c3d-2046925f72c1 to be owned by node=172.20.161.146, existing=172.20.136.49
...
INFO 2022-12-19 22:08:52,895 [shard 0] storage_service - entering NORMAL mode
...
INFO 2022-12-19 22:09:04,929 [shard 0] init - serving
INFO 2022-12-19 22:09:04,929 [shard 0] init - Scylla version 5.1.1-0.20221208.1cfedc5b59f4 initialization completed.
INFO 2022-12-19 22:09:52,911 [shard 0] gossip - 60000 ms elapsed, 172.20.136.49 gossip quarantine over
INFO 2022-12-19 22:09:53,911 [shard 0] gossip - FatClient 172.20.136.49 has been silent for 30000ms, removing from gossip
INFO 2022-12-19 22:10:53,928 [shard 0] gossip - 60000 ms elapsed, 172.20.136.49 gossip quarantine over
Then, SCT/loader logs:
2022-12-19 22:10:13,409 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2022-12-19 22:10:13.408: (PrometheusAlertManagerEvent Severity.NORMAL) period_type=begin event_id=10bb991f-8104-402a-8b46-f53e63810d3d: alert_name=InstanceDown node=10.0.8.142 start=2022-12-19T22:10:04.791Z end=2022-12-19T22:14:04.791Z description=10.0.8.142 has been down for more than 30 seconds. updated=2022-12-19T22:10:09.713Z state=active fingerprint=5ef727f953f696fb labels={'alertname': 'InstanceDown', 'instance': '10.0.8.142', 'job': 'scylla', 'monitor': 'scylla-monitor', 'severity': '2'}
2022-12-19 22:10:59,415 f:cluster.py l:3334 c:cassandra.cluster p:WARNING > [control connection] Error connecting to 10.0.8.142:9042: < t:2022-12-19 22:10:59,415 f:cluster.py l:3334 c:cassandra.cluster p:WARNING > [control connection] Error connecting to 10.0.8.142:9042:
2022-12-19 22:10:59,415 f:cluster.py l:3334 c:cassandra.cluster p:WARNING > Traceback (most recent call last):
2022-12-19 22:10:59,415 f:cluster.py l:3334 c:cassandra.cluster p:WARNING > File "cassandra/cluster.py", line 3606, in cassandra.cluster.ControlConnection._reconnect_internal
2022-12-19 22:10:59,415 f:cluster.py l:3334 c:cassandra.cluster p:WARNING > File "cassandra/cluster.py", line 3628, in cassandra.cluster.ControlConnection._try_connect
2022-12-19 22:10:59,415 f:cluster.py l:3334 c:cassandra.cluster p:WARNING > File "cassandra/cluster.py", line 1686, in cassandra.cluster.Cluster.connection_factory
2022-12-19 22:10:59,415 f:cluster.py l:3334 c:cassandra.cluster p:WARNING > File "cassandra/connection.py", line 875, in cassandra.connection.Connection.factory
2022-12-19 22:10:59,415 f:cluster.py l:3334 c:cassandra.cluster p:WARNING > File "/usr/local/lib/python3.10/site-packages/cassandra/io/libevreactor.py", line 267, in __init__
2022-12-19 22:10:59,415 f:cluster.py l:3334 c:cassandra.cluster p:WARNING > self._connect_socket()
2022-12-19 22:10:59,415 f:cluster.py l:3334 c:cassandra.cluster p:WARNING > File "cassandra/connection.py", line 954, in cassandra.connection.Connection._connect_socket
So, we can see that loader lost connectivity right after the old IP address was removed from gossip at 2022-12-19 22:09:53,911
.
This problem was kept till the end of the test run.
During the test run we had the periodic health checks. And the last successful health check was at 2022-12-19 22:09:18
:
2022-12-19 22:09:16,160 > Running command "/usr/bin/nodetool status "...
2022-12-19 22:09:17,034 > total, 599174, 1000, 1000, 1000, 0.6, 0.5, 0.7, 5.6, 12.6, 16.5, 600.0, 0.02990 , 0, 0, 0, 0, 0, 0
2022-12-19 22:09:18,537 > Datacenter: us-east1-b
2022-12-19 22:09:18,537 > ======================
2022-12-19 22:09:18,537 > Status=Up/Down
2022-12-19 22:09:18,538 > |/ State=Normal/Leaving/Joining/Moving
2022-12-19 22:09:18,547 > -- Address Load Tokens Owns Host ID Rack
2022-12-19 22:09:18,550 > UN 172.20.159.21 ? 256 ? 6367acd4-8c96-4842-b8a9-8305cbe1053a us-east1
2022-12-19 22:09:18,551 > UN 172.20.76.129 ? 256 ? c647e4a6-d2b4-4f49-919c-922b282c74c0 us-east1
2022-12-19 22:09:18,553 > UN 172.20.161.146 134.32 MB 256 ? 968276ba-5ee0-45b7-8c3d-2046925f72c1 us-east1
2022-12-19 22:09:18,553 >
2022-12-19 22:09:18,553 > Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
2022-12-19 22:09:18,599 > Command "/usr/bin/nodetool status " finished with status 0
2022-12-19 22:09:18,599 > Node sct-cluster-us-east1-b-us-east1-2 [10.0.8.200 | 10.0.11.210] (seed: False): Command '/usr/bin/nodetool status ' duration -> 2.4390276030003406 s
Then we were getting the following error constantly retrying to get the peers info as part of that health check:
2022-12-19 22:24:40,165 f:decorators.py l:69 c:sdcm.utils.decorators p:DEBUG > 'get_peers_info': failed with 'NoHostAvailable('Unable to connect to any servers', {'10.0.8.142:9042': OSError(None, "Tried connecting to [('10.0.8.142', 9042)]. Last error: timed out")})', retrying [#0]
Also python driver attempts:
2022-12-19 22:14:49,604 f:decorators.py l:69 c:sdcm.utils.decorators p:DEBUG > 'cql_connection_patient_exclusive': failed with 'NoHostAvailable('Unable to connect to any servers', {'10.0.8.142:9042': OSError(None, "Tried connecting to [('10.0.8.142', 9042)]. Last error: timed out")})', retrying [#2]
Then at 2022-12-19 23:31:32
we get successful health check right after the next round of the scenario happens where we delete the same K8S node again:
2022-12-19 23:26:30,338 f:health_checker.py l:41 c:sdcm.utils.health_checker p:DEBUG > Status for target node sct-cluster-us-east1-b-us-east1-2
2022-12-19 23:26:30,338 f:health_checker.py l:140 c:sdcm.utils.health_checker p:WARNING > SYSTEM.PEERS info is not availble. Search for the warning above. Verify schema version can't be performed
2022-12-19 23:26:30,338 > Node `sct-cluster-us-east1-b-us-east1-2' is healthy
2022-12-19 23:31:30,451 > Running command "/usr/bin/nodetool status "...
2022-12-19 23:31:30.450: (ClusterHealthValidatorEvent Severity.NORMAL) period_type=begin event_id=c035fefe-292c-4c87-8f9e-5f69eb19d155
2022-12-19 23:31:31,008 > total, 5504774, 1000, 1000, 1000, 0.5, 0.5, 0.7, 0.7, 2.6, 12.9, 5505.0, 0.00520, 0, 0, 0, 0, 0, 0
2022-12-19 23:31:32,008 > total, 5534174, 1000, 1000, 1000, 0.6, 0.5, 0.7, 1.2, 12.0, 16.0, 5535.0, 0.00310, 0, 0, 0, 0, 0, 0
2022-12-19 23:31:32,355 > Datacenter: us-east1-b
2022-12-19 23:31:32,356 > ======================
2022-12-19 23:31:32,356 > Status=Up/Down
2022-12-19 23:31:32,356 > |/ State=Normal/Leaving/Joining/Moving
2022-12-19 23:31:32,360 > -- Address Load Tokens Owns Host ID Rack
2022-12-19 23:31:32,366 > UN 172.20.159.21 1.08 MB 256 ? 6367acd4-8c96-4842-b8a9-8305cbe1053a us-east1
2022-12-19 23:31:32,368 > UN 172.20.76.129 39.92 MB 256 ? c647e4a6-d2b4-4f49-919c-922b282c74c0 us-east1
2022-12-19 23:31:32,369 > UN 172.20.161.146 134.32 MB 256 ? 968276ba-5ee0-45b7-8c3d-2046925f72c1 us-east1
2022-12-19 23:31:32,369 >
2022-12-19 23:31:32,369 > Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
2022-12-19 23:31:32,392 > Command "/usr/bin/nodetool status " finished with status 0
After it everything repeats to be missbehaving.
Note: if we decommission affected node and then add new one instead of the using the replace node
operation then we do not have such a problem.
To Reproduce Steps to reproduce the behavior:
- Deploy 3-node Scylla cluster
- Delete VM that hosts a K8S Scylla pod
- Mark the affected pod for replacement
- See error
Expected behavior Connectivity must work for the new node that replaced the old one.
Logs CI job: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-longevity-scylla-operator-3h-eks-terminate-replace/2 K8S logs: https://cloudius-jenkins-test.s3.amazonaws.com/aa03c3cd-4751-41a9-aa90-4dcacd09379e/20221220_014120/kubernetes-aa03c3cd.tar.gz Loader logs: https://cloudius-jenkins-test.s3.amazonaws.com/aa03c3cd-4751-41a9-aa90-4dcacd09379e/20221220_014120/loader-set-aa03c3cd.tar.gz DB logs: https://cloudius-jenkins-test.s3.amazonaws.com/aa03c3cd-4751-41a9-aa90-4dcacd09379e/20221220_014120/db-cluster-aa03c3cd.tar.gz SCT (Test runner) logs: https://cloudius-jenkins-test.s3.amazonaws.com/aa03c3cd-4751-41a9-aa90-4dcacd09379e/20221220_014120/sct-runner-aa03c3cd.tar.gz
Environment:
- Platform: EKS
- Kubernetes version: v1.22.15
- Scylla version: 5.1.1
- Scylla-operator version: e.g.: v1.8.0-alpha.0-162-g7be1034
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 18 (11 by maintainers)
Commits related to this issue
- ci(debug): https://github.com/scylladb/scylla-operator/issues/1124 — committed to vponomaryov/scylla-cluster-tests by vponomaryov a year ago
- ci(debug): https://github.com/scylladb/scylla-operator/issues/1124 — committed to vponomaryov/scylla-cluster-tests by vponomaryov a year ago
- ci(debug): https://github.com/scylladb/scylla-operator/issues/1124 — committed to vponomaryov/scylla-cluster-tests by vponomaryov a year ago
- ci(debug): https://github.com/scylladb/scylla-operator/issues/1124 — committed to vponomaryov/scylla-cluster-tests by vponomaryov a year ago
- ci(debug): https://github.com/scylladb/scylla-operator/issues/1124 — committed to vponomaryov/scylla-cluster-tests by vponomaryov a year ago
- ci(debug): https://github.com/scylladb/scylla-operator/issues/1124 — committed to vponomaryov/scylla-cluster-tests by vponomaryov a year ago
@vponomaryov
take https://github.com/scylladb/scylla-cluster-tests/pull/5627, and see if it fixes the issue we are seeing here.