scylla-operator: During rollout restart operator doesn't wait until previously restarted pod becomes part of a Scylla cluster
Describe the bug If we change the number of cores for Scylla (it will trigger rollout pod restart) then we start facing following racy condition: First pod gets restarted and if it gets into a cluster with some unexpected delay then operator doesn’t wait for it and restarts second pod. In a cluster of 3 nodes it makes cluster be inoperational not having quorum for several minutes. It happens on GKE and doesn’t happen on EKS because in the latter one case a restarted pod gets into a cluster much faster.
Below are the logs from a run on GKE cluster.
pod-2
(the first restarted) logs:
2022-10-27 18:40:55,710 INFO waiting for scylla to stop
INFO 2022-10-27 18:40:55,710 [shard 0] init - Signal received; shutting down
...
INFO 2022-10-27 18:40:57,897 [shard 0] gossip - Gossip is now stopped
...
INFO 2022-10-27 18:40:59,866 [shard 0] init - Scylla version 5.1.0~rc3-0.20221009.9deeeb4db1cd shutdown complete.
2022-10-27 18:40:59,940 INFO stopped: scylla (exit status 0)
...
I1027 18:41:03.305900 1 operator/sidecar.go:438] "Scylla process finished"
rpc error: code = NotFound desc = an error occurred when try to find container "ef77bed95aa5838282bc5d55420a2718d6b111912fc2954dc645c35c7ce87d3f": not foundI1027 18:41:11.371885 1 operator/sidecar.go:147] sidecar version "v1.8.0-alpha.0-133-g97c831e"
...
I1027 18:41:12.363766 1 operator/sidecar.go:360] "Starting scylla"
...
INFO 2022-10-27 18:41:15,764 [shard 0] database - Resharded 7kB for system.compaction_history in 0.54 seconds, 13kB/s
...
INFO 2022-10-27 18:41:18,583 [shard 0] init - loading system_schema sstables
...
INFO 2022-10-27 18:41:19,311 [shard 0] init - setting up system keyspace
...
INFO 2022-10-27 18:41:21,558 [shard 0] database - Resharded 235MB for keyspace1.standard1 in 2.02 seconds, 116MB/s
...
INFO 2022-10-27 18:41:21,621 [shard 0] storage_service - entering STARTING mode
...
INFO 2022-10-27 18:41:21,700 [shard 0] storage_service - Starting up server gossip
...
E1027 18:41:24.931854 1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
...
E1027 18:41:34.932431 1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
...
INFO 2022-10-27 18:41:44,256 [shard 0] storage_service - entering JOINING mode
...
INFO 2022-10-27 18:41:44,828 [shard 0] storage_service - Node 10.80.11.2 state jump to normal
E1027 18:41:44.931320 1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
INFO 2022-10-27 18:41:45,035 [shard 0] storage_service - entering NORMAL mode
...
INFO 2022-10-27 18:41:57,051 [shard 0] init - Scylla version 5.1.0~rc3-0.20221009.9deeeb4db1cd initialization completed.
WARN 2022-10-27 18:42:45,043 [shard 0] cdc - Could not update CDC description table with generation (2022/10/27 18:24:22, 69265668-5186-4a22-a2b1-6b6bad5a0f55): exceptions::unavailable_exception (Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1). Will try again.
WARN 2022-10-27 18:43:14,796 [shard 0] gossip - Skip marking node 10.80.4.107 with status = shutdown as UP
INFO 2022-10-27 18:43:14,796 [shard 0] gossip - InetAddress 10.80.14.183 is now UP, status = NORMAL
INFO 2022-10-27 18:43:14,798 [shard 0] storage_service - Node 10.80.14.183 state jump to normal
INFO 2022-10-27 18:43:14,801 [shard 0] storage_service - Node 10.80.4.107 state jump to normal
INFO 2022-10-27 18:43:14,950 [shard 0] gossip - InetAddress 10.80.12.113 is now DOWN, status = LEFT
INFO 2022-10-27 18:43:14,952 [shard 0] gossip - Node 10.80.12.113 will be removed from gossip at [2022-10-30 18:31:53]: (expire = 1667154713831435293, now = 1666896194952348069, diff = 258518 seconds)
...
WARN 2022-10-27 18:43:36,752 [shard 0] gossip - Fail to send EchoMessage to 10.80.4.107: seastar::rpc::closed_error (connection is closed)
INFO 2022-10-27 18:43:36,892 [shard 0] gossip - InetAddress 10.80.4.107 is now UP, status = NORMAL
pod-1
logs (second restarted):
...
INFO 2022-10-27 18:32:55,041 [shard 0] gossip - 60000 ms elapsed, 10.80.12.113 gossip quarantine over
INFO 2022-10-27 18:40:55,897 [shard 0] gossip - Got shutdown message from 10.80.11.2, received_generation=1666893961, local_generation=1666893961
INFO 2022-10-27 18:40:55,898 [shard 0] gossip - InetAddress 10.80.11.2 is now DOWN, status = shutdown
2022-10-27 18:42:05,276 INFO waiting for scylla to stop
INFO 2022-10-27 18:42:05,291 [shard 0] compaction_manager - Asked to stop
...
So, from the logs above we can see that the time when pod-1
(second restarted) got shutdown
message the first one
was not ready yet. The appeared quorum
error from CDC is the proof for it.
To Reproduce Steps to reproduce the behavior:
- Deploy operator
- Deploy 3-pod Scylla cluster on GKE
- Run some constant load
- Change the number of CPUs in the
scyllacluster
spec - Wait for the restart of the second pod
- See quorum-errors on the loader side.
Expected behavior Operator should wait for the previous pod return back to a Scylla cluster before restarting next one.
Logs K8S logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/kubernetes-0512c157.tar.gz DB logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/db-cluster-0512c157.tar.gz
Environment:
- Platform: GKE
- Kubernetes version: 1.22
- Scylla version: 5.0.5 , 5.1.0-rc3
- Scylla-operator version: e.g.: v1.8.0-alpha.0-133-g97c831e | v1.7.4
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 37 (19 by maintainers)
Commits related to this issue
- test(k8s): skip flush and reshard test due to operator's bug The 'test_nodetool_flush_and_reshard' test fails due to the following bug: https://github.com/scylladb/scylla-operator/issues/1077 So, ski... — committed to vponomaryov/scylla-cluster-tests by vponomaryov 2 years ago
- fix(k8s-functional): skip flush-and-reshard test for non-eks backends The 'test_nodetool_flush_and_reshard' test is affected by the following bug: https://github.com/scylladb/scylla-operator/iss... — committed to vponomaryov/scylla-cluster-tests by vponomaryov 2 years ago
- fix(k8s-functional): skip flush-and-reshard test for non-eks backends The 'test_nodetool_flush_and_reshard' test is affected by the following bug: https://github.com/scylladb/scylla-operator/iss... — committed to scylladb/scylla-cluster-tests by vponomaryov 2 years ago
- fix(k8s-nemesis): make flush_and_reshard run only having 14+ cores Operator's racy bug [1] gets reproduced on slow backends and nodes with small number of cores. So, skip it for the nodes that are no... — committed to vponomaryov/scylla-cluster-tests by vponomaryov 2 years ago
- fix(k8s-nemesis): make flush_and_reshard run only having 14+ cores Operator's racy bug [1] gets reproduced on slow backends and nodes with small number of cores. So, skip it for the nodes that are no... — committed to scylladb/scylla-cluster-tests by vponomaryov 2 years ago
After my last experiment, I aggregated the same types of errors.
Most of the errors were unavailability errors and I found out that in setups where number of total errors is low, almost all errors shows up when a node is tearing down.
I looked up Scylla documentation about how nodes should be teared down, and we have a mismatch. Changing PreStopHook to
nodetool drain
+supervisorctl stop scylla
, caused that Scylla started printing lots of messages about operation being aborted on other nodes. I found related issue (https://github.com/scylladb/scylladb/issues/10447), unfortunately fix is not easy to backport, so it wasn’t backported to recent versions. Apparently master was fixed but when I tried I wasn’t able to restart any node due to scylla being stuck. I tried with older versions, and found out that 5.0.13 doesn’t print these abort operation messages, and it also solved failures happening during node teardown. On PodIP setup I was left with only 2-3 EOF failures which either are scylla bug not gracefully shutting down connection, or gocql driver misbehaving. I left these unresolved to proceed further, we can tackle them later.This unfortunately didn’t solve traffic disruption on ClusterIP setups where errors were showing more than 1 node being down. This meant there’s a split brain in gossip state. Looking at multiple nodetool status of all nodes confirmed it.
I initially thought that maybe kube-proxy lags and iptables are not updated fast enough, but I ruled it out as experiments showed Service ClusterIP mappings in iptables are updated right after Pod is recreated with new PodIP.
Scylla keeps 4 connections between each shard and node on 7000 port used for inter-node communication. One of them is used for gossip. These connections are lazily initialized so it might be that there’re less than 4 connections if one was dropped and there wasn’t anyone needing one.
If we look at stable 2 node ClusterIP cluster where each node have 1 shard, we can see existing connections and their mapping using conntrack. Brief info about the cluster state:
There’re 3 connections from pod-0, to svc-1 with correct NATed address of pod-1, and 4 connections from pod-1 to svc-0 with correct NATed address of pod-0.
When pod-1 is being deleted, conntrack shows multiple attempts where pod-0 is trying to reconnect to svc-1 but fails which is expected. When pod-1 is recreated with new IP
10.85.0.32
, we can see that there’re 3 ongoing attempts to connect to svc-1 but with old pod-1 IP address in SYN_SENT state meaning they are awaiting ACK:Between when pod-1 was deleted, and new one recreated, pod-0 tried to reconnect but traffic was blackholed, meaning this SYN is lost, and this session needs to timeout. At the same time, we can see that old sessions entered TIME_WAIT state which is normal, and that pod-1 managed to connect to svc-0:
After SYN_SENT sessions expired, there were no sessions from pod-0 to svc-1, only pod-1 to svc-0:
Eventually pod-0 tried to reconnect and succeeded with correct pod-1 IP:
Looks like SYN_SENT sessions are the root cause of our gossip brainsplit as nodes cannot establish session to node being restarted. They report UN, as they are able to talk to others, while others are stuck on connection. Immediately after these stuck session expire, new sessions are established and gossip state synchronizes.
On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s, and then Scylla needs to reconnect which can take another couple of seconds and then exchange gossip.
To solve it we have several options:
I ran experiments on GKE with minReadySeconds on StatefulSets - in big shortcut it’s a sleep between updating next node, on different setups.
Each setup consisted of 3 unbound loaders sending read requests continuously using gocql driver with default config. After cluster was fully up and running, test triggered a rolling restart. Traffic was stopped 5s after restart was completed.
Discrepancy between ClusterIP and PodIP results might suggest that kube-proxy which provides ClusterIP overlay might cause most of the failures. To verify whether it’s the case, I repeated the tests on GKE with Dataplane V2 (Cillium managed by Google) where kube-proxy is not present.
Results shows Operator is not able to provide 100% success rate in any setup even when minReadySeconds is high. Although setting it to 60s would help a lot on default ClusterIP configuration while not influencing bootstrap time that much.
Simillar results were present when traffic was non-TLS.
when working with one DC, LOCAL_QUORUM=QUORUM, and all our test are writing/reading with QUORUM
Issues in scylla regarding loss of availability during node shutdown on newer versions and rpc_client connection timeout https://github.com/scylladb/scylladb/issues/15899 https://github.com/scylladb/scylladb/issues/15901
I added setting of conntrack TCP timeout of SYN_SENT entries to our node setup DaemonSet, it solved availability issues on both Cluster and PodIP settings without minReadySeconds.
We no longer see any availability errors meaning Scylla rolls out without loosing Quorum. EOFs might be scylla or gocql bug, not related to rollout. Timeouts may happen, as setups runs with low non-guaranteed resources.
Since we found configuration where we no longer observe any availability issues, I verified how different Scylla versions behave.
Versions >=5.1 cause request failures during when node is shutting down.
Changing
net.netfilter.nf_conntrack_tcp_timeout_syn_sent
to60s
breaks ClusterIP scenarios:Looks like setting
net.netfilter.nf_conntrack_tcp_timeout_syn_sent
to20s
would fix ClusterIP setups as it would enforce shorter timeout on rpc_client connection attempts. Adding configuration option to Scylla which would allow controlling this timeout would allow us to get rid of this workaround. We also need to fix Scylla, as supported versions have a bug causing availability issues when node is shutting down.https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
Our nodesetup pods are already running as root, so no extra permissions are required.
Removing only SYN_SENT entries to clusterIPs and Scylla ports after configurable and reasonable timeout should only cause more reconnection attempts.
That’s what I plan to look into next, maybe there’s something else causing disruption.
Nope
I haven’t tried with different node teardown logic. It’s something I want to tryout later as well.