scylla-operator: During rollout restart operator doesn't wait until previously restarted pod becomes part of a Scylla cluster

Describe the bug If we change the number of cores for Scylla (it will trigger rollout pod restart) then we start facing following racy condition: First pod gets restarted and if it gets into a cluster with some unexpected delay then operator doesn’t wait for it and restarts second pod. In a cluster of 3 nodes it makes cluster be inoperational not having quorum for several minutes. It happens on GKE and doesn’t happen on EKS because in the latter one case a restarted pod gets into a cluster much faster.

Below are the logs from a run on GKE cluster. pod-2 (the first restarted) logs:

2022-10-27 18:40:55,710 INFO waiting for scylla to stop
INFO  2022-10-27 18:40:55,710 [shard 0] init - Signal received; shutting down
...
INFO  2022-10-27 18:40:57,897 [shard 0] gossip - Gossip is now stopped
...
INFO  2022-10-27 18:40:59,866 [shard 0] init - Scylla version 5.1.0~rc3-0.20221009.9deeeb4db1cd shutdown complete.
2022-10-27 18:40:59,940 INFO stopped: scylla (exit status 0)
...
I1027 18:41:03.305900       1 operator/sidecar.go:438] "Scylla process finished"
rpc error: code = NotFound desc = an error occurred when try to find container "ef77bed95aa5838282bc5d55420a2718d6b111912fc2954dc645c35c7ce87d3f": not foundI1027 18:41:11.371885       1 operator/sidecar.go:147] sidecar version "v1.8.0-alpha.0-133-g97c831e"
...
I1027 18:41:12.363766       1 operator/sidecar.go:360] "Starting scylla"
...
INFO  2022-10-27 18:41:15,764 [shard 0] database - Resharded 7kB for system.compaction_history in 0.54 seconds, 13kB/s
...
INFO  2022-10-27 18:41:18,583 [shard 0] init - loading system_schema sstables
...
INFO  2022-10-27 18:41:19,311 [shard 0] init - setting up system keyspace
...
INFO  2022-10-27 18:41:21,558 [shard 0] database - Resharded 235MB for keyspace1.standard1 in 2.02 seconds, 116MB/s
...
INFO  2022-10-27 18:41:21,621 [shard 0] storage_service - entering STARTING mode
...
INFO  2022-10-27 18:41:21,700 [shard 0] storage_service - Starting up server gossip
...
E1027 18:41:24.931854       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
...
E1027 18:41:34.932431       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
...
INFO  2022-10-27 18:41:44,256 [shard 0] storage_service - entering JOINING mode
...
INFO  2022-10-27 18:41:44,828 [shard 0] storage_service - Node 10.80.11.2 state jump to normal
E1027 18:41:44.931320       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
INFO  2022-10-27 18:41:45,035 [shard 0] storage_service - entering NORMAL mode
...
INFO  2022-10-27 18:41:57,051 [shard 0] init - Scylla version 5.1.0~rc3-0.20221009.9deeeb4db1cd initialization completed.
WARN  2022-10-27 18:42:45,043 [shard 0] cdc - Could not update CDC description table with generation (2022/10/27 18:24:22, 69265668-5186-4a22-a2b1-6b6bad5a0f55): exceptions::unavailable_exception (Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1). Will try again.
WARN  2022-10-27 18:43:14,796 [shard 0] gossip - Skip marking node 10.80.4.107 with status = shutdown as UP
INFO  2022-10-27 18:43:14,796 [shard 0] gossip - InetAddress 10.80.14.183 is now UP, status = NORMAL
INFO  2022-10-27 18:43:14,798 [shard 0] storage_service - Node 10.80.14.183 state jump to normal
INFO  2022-10-27 18:43:14,801 [shard 0] storage_service - Node 10.80.4.107 state jump to normal
INFO  2022-10-27 18:43:14,950 [shard 0] gossip - InetAddress 10.80.12.113 is now DOWN, status = LEFT
INFO  2022-10-27 18:43:14,952 [shard 0] gossip - Node 10.80.12.113 will be removed from gossip at [2022-10-30 18:31:53]: (expire = 1667154713831435293, now = 1666896194952348069, diff = 258518 seconds)
...
WARN  2022-10-27 18:43:36,752 [shard 0] gossip - Fail to send EchoMessage to 10.80.4.107: seastar::rpc::closed_error (connection is closed)
INFO  2022-10-27 18:43:36,892 [shard 0] gossip - InetAddress 10.80.4.107 is now UP, status = NORMAL

pod-1 logs (second restarted):

...
INFO  2022-10-27 18:32:55,041 [shard 0] gossip - 60000 ms elapsed, 10.80.12.113 gossip quarantine over
INFO  2022-10-27 18:40:55,897 [shard 0] gossip - Got shutdown message from 10.80.11.2, received_generation=1666893961, local_generation=1666893961
INFO  2022-10-27 18:40:55,898 [shard 0] gossip - InetAddress 10.80.11.2 is now DOWN, status = shutdown
2022-10-27 18:42:05,276 INFO waiting for scylla to stop
INFO  2022-10-27 18:42:05,291 [shard 0] compaction_manager - Asked to stop
...

So, from the logs above we can see that the time when pod-1 (second restarted) got shutdown message the first one was not ready yet. The appeared quorum error from CDC is the proof for it.

To Reproduce Steps to reproduce the behavior:

Deploy operator
Deploy 3-pod Scylla cluster on GKE
Run some constant load
Change the number of CPUs in the scyllacluster spec
Wait for the restart of the second pod
See quorum-errors on the loader side.

Expected behavior Operator should wait for the previous pod return back to a Scylla cluster before restarting next one.

Logs K8S logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/kubernetes-0512c157.tar.gz DB logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/db-cluster-0512c157.tar.gz

Environment:

Platform: GKE
Kubernetes version: 1.22
Scylla version: 5.0.5 , 5.1.0-rc3
Scylla-operator version: e.g.: v1.8.0-alpha.0-133-g97c831e | v1.7.4

About this issue

Original URL
State: open
Created 2 years ago
Comments: 37 (19 by maintainers)

Commits related to this issue

test(k8s): skip flush and reshard test due to operator's bug The 'test_nodetool_flush_and_reshard' test fails due to the following bug: https://github.com/scylladb/scylla-operator/issues/1077 So, ski... — committed to vponomaryov/scylla-cluster-tests by vponomaryov 2 years ago
fix(k8s-functional): skip flush-and-reshard test for non-eks backends The 'test_nodetool_flush_and_reshard' test is affected by the following bug: https://github.com/scylladb/scylla-operator/iss... — committed to vponomaryov/scylla-cluster-tests by vponomaryov 2 years ago
fix(k8s-functional): skip flush-and-reshard test for non-eks backends The 'test_nodetool_flush_and_reshard' test is affected by the following bug: https://github.com/scylladb/scylla-operator/iss... — committed to scylladb/scylla-cluster-tests by vponomaryov 2 years ago
fix(k8s-nemesis): make flush_and_reshard run only having 14+ cores Operator's racy bug [1] gets reproduced on slow backends and nodes with small number of cores. So, skip it for the nodes that are no... — committed to vponomaryov/scylla-cluster-tests by vponomaryov 2 years ago
fix(k8s-nemesis): make flush_and_reshard run only having 14+ cores Operator's racy bug [1] gets reproduced on slow backends and nodes with small number of cores. So, skip it for the nodes that are no... — committed to scylladb/scylla-cluster-tests by vponomaryov 2 years ago

Most upvoted comments

After my last experiment, I aggregated the same types of errors.

Most of the errors were unavailability errors and I found out that in setups where number of total errors is low, almost all errors shows up when a node is tearing down.

I looked up Scylla documentation about how nodes should be teared down, and we have a mismatch. Changing PreStopHook to nodetool drain + supervisorctl stop scylla, caused that Scylla started printing lots of messages about operation being aborted on other nodes. I found related issue (https://github.com/scylladb/scylladb/issues/10447), unfortunately fix is not easy to backport, so it wasn’t backported to recent versions. Apparently master was fixed but when I tried I wasn’t able to restart any node due to scylla being stuck. I tried with older versions, and found out that 5.0.13 doesn’t print these abort operation messages, and it also solved failures happening during node teardown. On PodIP setup I was left with only 2-3 EOF failures which either are scylla bug not gracefully shutting down connection, or gocql driver misbehaving. I left these unresolved to proceed further, we can tackle them later.

This unfortunately didn’t solve traffic disruption on ClusterIP setups where errors were showing more than 1 node being down. This meant there’s a split brain in gossip state. Looking at multiple nodetool status of all nodes confirmed it.

I initially thought that maybe kube-proxy lags and iptables are not updated fast enough, but I ruled it out as experiments showed Service ClusterIP mappings in iptables are updated right after Pod is recreated with new PodIP.

Scylla keeps 4 connections between each shard and node on 7000 port used for inter-node communication. One of them is used for gossip. These connections are lazily initialized so it might be that there’re less than 4 connections if one was dropped and there wasn’t anyone needing one.

If we look at stable 2 node ClusterIP cluster where each node have 1 shard, we can see existing connections and their mapping using conntrack. Brief info about the cluster state:

pod-0: 10.85.0.28
pod-1: 10.85.0.31

svc-0: 10.106.100.33
svc-1: 10.99.34.233

tcp      6 431999 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=61507 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=61507 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=58256 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=58256 [ASSURED] mark=0 use=1
tcp      6 431972 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=60959 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=60959 [ASSURED] mark=0 use=1
tcp      6 431972 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=54112 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=54112 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=49551 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49551 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=55568 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=55568 [ASSURED] mark=0 use=1 
tcp      6 431943 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=49506 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49506 [ASSURED] mark=0 use=1

There’re 3 connections from pod-0, to svc-1 with correct NATed address of pod-1, and 4 connections from pod-1 to svc-0 with correct NATed address of pod-0.

When pod-1 is being deleted, conntrack shows multiple attempts where pod-0 is trying to reconnect to svc-1 but fails which is expected. When pod-1 is recreated with new IP 10.85.0.32, we can see that there’re 3 ongoing attempts to connect to svc-1 but with old pod-1 IP address in SYN_SENT state meaning they are awaiting ACK:

tcp      6 75 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=53454 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=53454 mark=0 use=1
tcp      6 74 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=57715 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=57715 mark=0 use=1
tcp      6 75 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=64317 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=64317 mark=0 use=1

Between when pod-1 was deleted, and new one recreated, pod-0 tried to reconnect but traffic was blackholed, meaning this SYN is lost, and this session needs to timeout. At the same time, we can see that old sessions entered TIME_WAIT state which is normal, and that pod-1 managed to connect to svc-0:

tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=54112 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=54112 [ASSURED] mark=0 use=2       
tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=49551 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49551 [ASSURED] mark=0 use=1       
tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=55568 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=55568 [ASSURED] mark=0 use=1       
tcp      6 69 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=58760 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=58760 [ASSURED] mark=0 use=1       
tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=49506 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49506 [ASSURED] mark=0 use=1       
tcp      6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=61507 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=61507 [ASSURED] mark=0 use=1     
tcp      6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=58256 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=58256 [ASSURED] mark=0 use=1      
tcp      6 69 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=50906 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=50906 [ASSURED] mark=0 use=1      
tcp      6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=60959 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=60959 [ASSURED] mark=0 use=1      
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp      6 431998 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1

After SYN_SENT sessions expired, there were no sessions from pod-0 to svc-1, only pod-1 to svc-0:

tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1

Eventually pod-0 tried to reconnect and succeeded with correct pod-1 IP:

tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=54055 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=54055 [ASSURED] mark=0 use=1 
tcp      6 431991 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=55146 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=55146 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=52300 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=52300 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp      6 431991 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1

Looks like SYN_SENT sessions are the root cause of our gossip brainsplit as nodes cannot establish session to node being restarted. They report UN, as they are able to talk to others, while others are stuck on connection. Immediately after these stuck session expire, new sessions are established and gossip state synchronizes.

On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s, and then Scylla needs to reconnect which can take another couple of seconds and then exchange gossip.

To solve it we have several options:

introduce new configuration value into Scylla which would control the timeout on rpc client connect - new feature meaning we will get it only in next version
remove UNREPLIED conntrack entries for ScyllaCluster Pods. When next SYN retransmission happens and there’s no conntrack entry, connection attempt fails and Scylla reconnects. This could be a good workaround until timeout is introduced in Scylla, as we already have nodesetup privileged Pods pods running in host network on every Scylla node.
Set minReadySeconds to value high enough to extend maximum SYN retransmission. This has a big downside of cluster boostrap time being increased.
Come back to initial idea about collecting state of all (QUORUM?) nodes in readiness probe.
Maybe something else, suggestions are welcome.

zimnx on Oct 26, 2023

I ran experiments on GKE with minReadySeconds on StatefulSets - in big shortcut it’s a sleep between updating next node, on different setups.

Each setup consisted of 3 unbound loaders sending read requests continuously using gocql driver with default config. After cluster was fully up and running, test triggered a rolling restart. Traffic was stopped 5s after restart was completed.

setup	requests	failures	success ratio
ClusterIP TLS minReadySeconds=0	360008	278367	0.226
ClusterIP TLS minReadySeconds=60	161707	18085	0.888
ClusterIP TLS minReadySeconds=120	293230	12554	0.957
PodIP TLS minReadySeconds=0	80427	78	0.999
PodIP TLS minReadySeconds=60	118925	53	0.999
PodIP TLS minReadySeconds=120	244028	74	0.999

Discrepancy between ClusterIP and PodIP results might suggest that kube-proxy which provides ClusterIP overlay might cause most of the failures. To verify whether it’s the case, I repeated the tests on GKE with Dataplane V2 (Cillium managed by Google) where kube-proxy is not present.

setup	requests	failures	success ratio
ClusterIP TLS minReadySeconds=0	54367	979	0.982
ClusterIP TLS minReadySeconds=60	113112	441	0.996
ClusterIP TLS minReadySeconds=120	201340	789	0.996
PodIP TLS minReadySeconds=0	61446	1701	0.972
PodIP TLS minReadySeconds=60	114669	159	0.999
PodIP TLS minReadySeconds=120	198945	638	0.997

Results shows Operator is not able to provide 100% success rate in any setup even when minReadySeconds is high. Although setting it to 60s would help a lot on default ClusterIP configuration while not influencing bootstrap time that much.

Simillar results were present when traffic was non-TLS.

zimnx on Oct 18, 2023

@vponomaryov btw. does QA have a test that tries to write to scylla with CL=LOCAL_QUORUM during a rolling restart as for https://docs.scylladb.com/stable/operating-scylla/procedures/config-change/rolling-restart.html ?

when working with one DC, LOCAL_QUORUM=QUORUM, and all our test are writing/reading with QUORUM

fruch on Mar 13, 2023

Issues in scylla regarding loss of availability during node shutdown on newer versions and rpc_client connection timeout https://github.com/scylladb/scylladb/issues/15899 https://github.com/scylladb/scylladb/issues/15901

zimnx on Oct 31, 2023

I added setting of conntrack TCP timeout of SYN_SENT entries to our node setup DaemonSet, it solved availability issues on both Cluster and PodIP settings without minReadySeconds.

setup	requests	EOF/request timeout	availability failures	success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s	56621	4	0	0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s	56411	4	0	0.999

We no longer see any availability errors meaning Scylla rolls out without loosing Quorum. EOFs might be scylla or gocql bug, not related to rollout. Timeouts may happen, as setups runs with low non-guaranteed resources.

Since we found configuration where we no longer observe any availability issues, I verified how different Scylla versions behave.

setup	requests	EOF/request timeout	availability failures	success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s	33926	3	0	0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s	33750	4	0	0.999
PodIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=20s	31458	1	19	0.999
ClusterIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=20s	34919	0	15	0.999
ClusterIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=20s	30176	2	16	0.999
PodIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=20s	30512	3	13	0.999
ClusterIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=20s	32213	1	11	0.999
PodIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=20s	31847	2	20	0.999
PodIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=20s	40221	1	5	0.999
ClusterIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=20s	40892	2	7	0.999

Versions >=5.1 cause request failures during when node is shutting down.

Changing net.netfilter.nf_conntrack_tcp_timeout_syn_sent to 60s breaks ClusterIP scenarios:

setup	requests	EOF/request timeout	availability failures	success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=60s	35792	9	0	0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=60s	35688	2	0	0.999
PodIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=60s	36021	1	21	0.999
ClusterIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=60s	37997	9	2620	0.93
ClusterIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=60s	36494	10	3055	0.92
PodIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=60s	33809	3	10	0.999
ClusterIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=60s	38529	12	2816	0.93
PodIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=60s	40817	2	9	0.999
PodIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=60s	41197	1	3	0.999
ClusterIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=60s	42154	12	3599	0.91

Looks like setting net.netfilter.nf_conntrack_tcp_timeout_syn_sent to 20s would fix ClusterIP setups as it would enforce shorter timeout on rpc_client connection attempts. Adding configuration option to Scylla which would allow controlling this timeout would allow us to get rid of this workaround. We also need to fix Scylla, as supported versions have a bug causing availability issues when node is shutting down.

zimnx on Oct 30, 2023

On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s

Just to understand it, could you explain the calculation here?

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

tcp_syn_retries - INTEGER Number of times initial SYNs for an active TCP connection attempt will be retransmitted. Should not be higher than 127. Default value is 6, which corresponds to 63seconds till the last retransmission with the current initial RTO of 1second. With this the final timeout for an active TCP connection attempt will happen after 127seconds.

Remember that such workaround would require running with additional linux capabilities

Our nodesetup pods are already running as root, so no extra permissions are required.

I also wonder if that wouldn’t affect stability of otherwise healthy clusters.

Removing only SYN_SENT entries to clusterIPs and Scylla ports after configurable and reasonable timeout should only cause more reconnection attempts.

I recall you mentioned 180s not being a value high enough to completely rid us of errors, which is quite surprising now given the 127s above. Was that caused by a higher syn_sent conntrack timeout?

That’s what I plan to look into next, maybe there’s something else causing disruption.

It’s worth explicitly stating that the issue comes from running kube-proxy depending on netfilter’s conntrack. From my understanding at this point we’re not sure if this is specific to running kube-proxy in iptables mode, or if that also occurs in ipvs mode. @zimnx have you tried it?

Nope

Like you said above, you haven’t hit the issue while running in GKE with Dataplane V2 (kube-proxy-less Cillium).

I haven’t tried with different node teardown logic. It’s something I want to tryout later as well.

zimnx on Oct 27, 2023