scylla-operator: During rollout restart operator doesn't wait until previously restarted pod becomes part of a Scylla cluster

Describe the bug If we change the number of cores for Scylla (it will trigger rollout pod restart) then we start facing following racy condition: First pod gets restarted and if it gets into a cluster with some unexpected delay then operator doesn’t wait for it and restarts second pod. In a cluster of 3 nodes it makes cluster be inoperational not having quorum for several minutes. It happens on GKE and doesn’t happen on EKS because in the latter one case a restarted pod gets into a cluster much faster.

Below are the logs from a run on GKE cluster. pod-2 (the first restarted) logs:

2022-10-27 18:40:55,710 INFO waiting for scylla to stop
INFO  2022-10-27 18:40:55,710 [shard 0] init - Signal received; shutting down
...
INFO  2022-10-27 18:40:57,897 [shard 0] gossip - Gossip is now stopped
...
INFO  2022-10-27 18:40:59,866 [shard 0] init - Scylla version 5.1.0~rc3-0.20221009.9deeeb4db1cd shutdown complete.
2022-10-27 18:40:59,940 INFO stopped: scylla (exit status 0)
...
I1027 18:41:03.305900       1 operator/sidecar.go:438] "Scylla process finished"
rpc error: code = NotFound desc = an error occurred when try to find container "ef77bed95aa5838282bc5d55420a2718d6b111912fc2954dc645c35c7ce87d3f": not foundI1027 18:41:11.371885       1 operator/sidecar.go:147] sidecar version "v1.8.0-alpha.0-133-g97c831e"
...
I1027 18:41:12.363766       1 operator/sidecar.go:360] "Starting scylla"
...
INFO  2022-10-27 18:41:15,764 [shard 0] database - Resharded 7kB for system.compaction_history in 0.54 seconds, 13kB/s
...
INFO  2022-10-27 18:41:18,583 [shard 0] init - loading system_schema sstables
...
INFO  2022-10-27 18:41:19,311 [shard 0] init - setting up system keyspace
...
INFO  2022-10-27 18:41:21,558 [shard 0] database - Resharded 235MB for keyspace1.standard1 in 2.02 seconds, 116MB/s
...
INFO  2022-10-27 18:41:21,621 [shard 0] storage_service - entering STARTING mode
...
INFO  2022-10-27 18:41:21,700 [shard 0] storage_service - Starting up server gossip
...
E1027 18:41:24.931854       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
...
E1027 18:41:34.932431       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
...
INFO  2022-10-27 18:41:44,256 [shard 0] storage_service - entering JOINING mode
...
INFO  2022-10-27 18:41:44,828 [shard 0] storage_service - Node 10.80.11.2 state jump to normal
E1027 18:41:44.931320       1 sidecar/probes.go:110] "readyz probe: can't get scylla native transport" err="giving up after 1 attempts: agent [HTTP 404] Not found" Service="scylla/sct-cluster-us-east1-b-us-east1-2" Node="10.80.11.2"
INFO  2022-10-27 18:41:45,035 [shard 0] storage_service - entering NORMAL mode
...
INFO  2022-10-27 18:41:57,051 [shard 0] init - Scylla version 5.1.0~rc3-0.20221009.9deeeb4db1cd initialization completed.
WARN  2022-10-27 18:42:45,043 [shard 0] cdc - Could not update CDC description table with generation (2022/10/27 18:24:22, 69265668-5186-4a22-a2b1-6b6bad5a0f55): exceptions::unavailable_exception (Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1). Will try again.
WARN  2022-10-27 18:43:14,796 [shard 0] gossip - Skip marking node 10.80.4.107 with status = shutdown as UP
INFO  2022-10-27 18:43:14,796 [shard 0] gossip - InetAddress 10.80.14.183 is now UP, status = NORMAL
INFO  2022-10-27 18:43:14,798 [shard 0] storage_service - Node 10.80.14.183 state jump to normal
INFO  2022-10-27 18:43:14,801 [shard 0] storage_service - Node 10.80.4.107 state jump to normal
INFO  2022-10-27 18:43:14,950 [shard 0] gossip - InetAddress 10.80.12.113 is now DOWN, status = LEFT
INFO  2022-10-27 18:43:14,952 [shard 0] gossip - Node 10.80.12.113 will be removed from gossip at [2022-10-30 18:31:53]: (expire = 1667154713831435293, now = 1666896194952348069, diff = 258518 seconds)
...
WARN  2022-10-27 18:43:36,752 [shard 0] gossip - Fail to send EchoMessage to 10.80.4.107: seastar::rpc::closed_error (connection is closed)
INFO  2022-10-27 18:43:36,892 [shard 0] gossip - InetAddress 10.80.4.107 is now UP, status = NORMAL

pod-1 logs (second restarted):

...
INFO  2022-10-27 18:32:55,041 [shard 0] gossip - 60000 ms elapsed, 10.80.12.113 gossip quarantine over
INFO  2022-10-27 18:40:55,897 [shard 0] gossip - Got shutdown message from 10.80.11.2, received_generation=1666893961, local_generation=1666893961
INFO  2022-10-27 18:40:55,898 [shard 0] gossip - InetAddress 10.80.11.2 is now DOWN, status = shutdown
2022-10-27 18:42:05,276 INFO waiting for scylla to stop
INFO  2022-10-27 18:42:05,291 [shard 0] compaction_manager - Asked to stop
...

So, from the logs above we can see that the time when pod-1 (second restarted) got shutdown message the first one was not ready yet. The appeared quorum error from CDC is the proof for it.

To Reproduce Steps to reproduce the behavior:

  1. Deploy operator
  2. Deploy 3-pod Scylla cluster on GKE
  3. Run some constant load
  4. Change the number of CPUs in the scyllacluster spec
  5. Wait for the restart of the second pod
  6. See quorum-errors on the loader side.

Expected behavior Operator should wait for the previous pod return back to a Scylla cluster before restarting next one.

Logs K8S logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/kubernetes-0512c157.tar.gz DB logs: https://cloudius-jenkins-test.s3.amazonaws.com/0512c157-c4d7-4d3f-9b44-2bcce9d34de9/20221027_191640/db-cluster-0512c157.tar.gz

Environment:

  • Platform: GKE
  • Kubernetes version: 1.22
  • Scylla version: 5.0.5 , 5.1.0-rc3
  • Scylla-operator version: e.g.: v1.8.0-alpha.0-133-g97c831e | v1.7.4

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 37 (19 by maintainers)

Commits related to this issue

Most upvoted comments

After my last experiment, I aggregated the same types of errors.

Most of the errors were unavailability errors and I found out that in setups where number of total errors is low, almost all errors shows up when a node is tearing down.

I looked up Scylla documentation about how nodes should be teared down, and we have a mismatch. Changing PreStopHook to nodetool drain + supervisorctl stop scylla, caused that Scylla started printing lots of messages about operation being aborted on other nodes. I found related issue (https://github.com/scylladb/scylladb/issues/10447), unfortunately fix is not easy to backport, so it wasn’t backported to recent versions. Apparently master was fixed but when I tried I wasn’t able to restart any node due to scylla being stuck. I tried with older versions, and found out that 5.0.13 doesn’t print these abort operation messages, and it also solved failures happening during node teardown. On PodIP setup I was left with only 2-3 EOF failures which either are scylla bug not gracefully shutting down connection, or gocql driver misbehaving. I left these unresolved to proceed further, we can tackle them later.

This unfortunately didn’t solve traffic disruption on ClusterIP setups where errors were showing more than 1 node being down. This meant there’s a split brain in gossip state. Looking at multiple nodetool status of all nodes confirmed it.

I initially thought that maybe kube-proxy lags and iptables are not updated fast enough, but I ruled it out as experiments showed Service ClusterIP mappings in iptables are updated right after Pod is recreated with new PodIP.

Scylla keeps 4 connections between each shard and node on 7000 port used for inter-node communication. One of them is used for gossip. These connections are lazily initialized so it might be that there’re less than 4 connections if one was dropped and there wasn’t anyone needing one.

If we look at stable 2 node ClusterIP cluster where each node have 1 shard, we can see existing connections and their mapping using conntrack. Brief info about the cluster state:

pod-0: 10.85.0.28
pod-1: 10.85.0.31

svc-0: 10.106.100.33
svc-1: 10.99.34.233
tcp      6 431999 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=61507 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=61507 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=58256 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=58256 [ASSURED] mark=0 use=1
tcp      6 431972 ESTABLISHED src=10.85.0.31 dst=10.106.100.33 sport=60959 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=60959 [ASSURED] mark=0 use=1
tcp      6 431972 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=54112 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=54112 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=49551 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49551 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=55568 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=55568 [ASSURED] mark=0 use=1 
tcp      6 431943 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=49506 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49506 [ASSURED] mark=0 use=1 

There’re 3 connections from pod-0, to svc-1 with correct NATed address of pod-1, and 4 connections from pod-1 to svc-0 with correct NATed address of pod-0.

When pod-1 is being deleted, conntrack shows multiple attempts where pod-0 is trying to reconnect to svc-1 but fails which is expected. When pod-1 is recreated with new IP 10.85.0.32, we can see that there’re 3 ongoing attempts to connect to svc-1 but with old pod-1 IP address in SYN_SENT state meaning they are awaiting ACK:

tcp      6 75 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=53454 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=53454 mark=0 use=1
tcp      6 74 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=57715 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=57715 mark=0 use=1
tcp      6 75 SYN_SENT src=10.85.0.28 dst=10.99.34.233 sport=64317 dport=7000 [UNREPLIED] src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=64317 mark=0 use=1

Between when pod-1 was deleted, and new one recreated, pod-0 tried to reconnect but traffic was blackholed, meaning this SYN is lost, and this session needs to timeout. At the same time, we can see that old sessions entered TIME_WAIT state which is normal, and that pod-1 managed to connect to svc-0:

tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=54112 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=54112 [ASSURED] mark=0 use=2       
tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=49551 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49551 [ASSURED] mark=0 use=1       
tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=55568 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=55568 [ASSURED] mark=0 use=1       
tcp      6 69 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=58760 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=58760 [ASSURED] mark=0 use=1       
tcp      6 65 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=49506 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=49506 [ASSURED] mark=0 use=1       
tcp      6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=61507 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=61507 [ASSURED] mark=0 use=1     
tcp      6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=58256 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=58256 [ASSURED] mark=0 use=1      
tcp      6 69 TIME_WAIT src=10.85.0.28 dst=10.99.34.233 sport=50906 dport=7000 src=10.85.0.31 dst=10.85.0.28 sport=7000 dport=50906 [ASSURED] mark=0 use=1      
tcp      6 69 TIME_WAIT src=10.85.0.31 dst=10.106.100.33 sport=60959 dport=7000 src=10.85.0.28 dst=10.85.0.31 sport=7000 dport=60959 [ASSURED] mark=0 use=1      
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp      6 431998 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1

After SYN_SENT sessions expired, there were no sessions from pod-0 to svc-1, only pod-1 to svc-0:

tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1

Eventually pod-0 tried to reconnect and succeeded with correct pod-1 IP:

tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=54055 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=54055 [ASSURED] mark=0 use=1 
tcp      6 431991 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=55146 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=55146 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.28 dst=10.99.34.233 sport=52300 dport=7000 src=10.85.0.32 dst=10.85.0.28 sport=7000 dport=52300 [ASSURED] mark=0 use=1 
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=55614 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=55614 [ASSURED] mark=0 use=1
tcp      6 431991 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=60774 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=60774 [ASSURED] mark=0 use=1
tcp      6 431999 ESTABLISHED src=10.85.0.32 dst=10.106.100.33 sport=62421 dport=7000 src=10.85.0.28 dst=10.85.0.32 sport=7000 dport=62421 [ASSURED] mark=0 use=1

Looks like SYN_SENT sessions are the root cause of our gossip brainsplit as nodes cannot establish session to node being restarted. They report UN, as they are able to talk to others, while others are stuck on connection. Immediately after these stuck session expire, new sessions are established and gossip state synchronizes.

On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s, and then Scylla needs to reconnect which can take another couple of seconds and then exchange gossip.

To solve it we have several options:

  • introduce new configuration value into Scylla which would control the timeout on rpc client connect - new feature meaning we will get it only in next version
  • remove UNREPLIED conntrack entries for ScyllaCluster Pods. When next SYN retransmission happens and there’s no conntrack entry, connection attempt fails and Scylla reconnects. This could be a good workaround until timeout is introduced in Scylla, as we already have nodesetup privileged Pods pods running in host network on every Scylla node.
  • Set minReadySeconds to value high enough to extend maximum SYN retransmission. This has a big downside of cluster boostrap time being increased.
  • Come back to initial idea about collecting state of all (QUORUM?) nodes in readiness probe.
  • Maybe something else, suggestions are welcome.

I ran experiments on GKE with minReadySeconds on StatefulSets - in big shortcut it’s a sleep between updating next node, on different setups.

Each setup consisted of 3 unbound loaders sending read requests continuously using gocql driver with default config. After cluster was fully up and running, test triggered a rolling restart. Traffic was stopped 5s after restart was completed.

setup requests failures success ratio
ClusterIP TLS minReadySeconds=0 360008 278367 0.226
ClusterIP TLS minReadySeconds=60 161707 18085 0.888
ClusterIP TLS minReadySeconds=120 293230 12554 0.957
PodIP TLS minReadySeconds=0 80427 78 0.999
PodIP TLS minReadySeconds=60 118925 53 0.999
PodIP TLS minReadySeconds=120 244028 74 0.999

Discrepancy between ClusterIP and PodIP results might suggest that kube-proxy which provides ClusterIP overlay might cause most of the failures. To verify whether it’s the case, I repeated the tests on GKE with Dataplane V2 (Cillium managed by Google) where kube-proxy is not present.

setup requests failures success ratio
ClusterIP TLS minReadySeconds=0 54367 979 0.982
ClusterIP TLS minReadySeconds=60 113112 441 0.996
ClusterIP TLS minReadySeconds=120 201340 789 0.996
PodIP TLS minReadySeconds=0 61446 1701 0.972
PodIP TLS minReadySeconds=60 114669 159 0.999
PodIP TLS minReadySeconds=120 198945 638 0.997

Results shows Operator is not able to provide 100% success rate in any setup even when minReadySeconds is high. Although setting it to 60s would help a lot on default ClusterIP configuration while not influencing bootstrap time that much.

Simillar results were present when traffic was non-TLS.

@vponomaryov btw. does QA have a test that tries to write to scylla with CL=LOCAL_QUORUM during a rolling restart as for https://docs.scylladb.com/stable/operating-scylla/procedures/config-change/rolling-restart.html ?

when working with one DC, LOCAL_QUORUM=QUORUM, and all our test are writing/reading with QUORUM

Issues in scylla regarding loss of availability during node shutdown on newer versions and rpc_client connection timeout https://github.com/scylladb/scylladb/issues/15899 https://github.com/scylladb/scylladb/issues/15901

I added setting of conntrack TCP timeout of SYN_SENT entries to our node setup DaemonSet, it solved availability issues on both Cluster and PodIP settings without minReadySeconds.

setup requests EOF/request timeout availability failures success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s 56621 4 0 0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s 56411 4 0 0.999

We no longer see any availability errors meaning Scylla rolls out without loosing Quorum. EOFs might be scylla or gocql bug, not related to rollout. Timeouts may happen, as setups runs with low non-guaranteed resources.

Since we found configuration where we no longer observe any availability issues, I verified how different Scylla versions behave.

setup requests EOF/request timeout availability failures success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s 33926 3 0 0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=20s 33750 4 0 0.999
PodIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=20s 31458 1 19 0.999
ClusterIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=20s 34919 0 15 0.999
ClusterIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=20s 30176 2 16 0.999
PodIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=20s 30512 3 13 0.999
ClusterIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=20s 32213 1 11 0.999
PodIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=20s 31847 2 20 0.999
PodIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=20s 40221 1 5 0.999
ClusterIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=20s 40892 2 7 0.999

Versions >=5.1 cause request failures during when node is shutting down.

Changing net.netfilter.nf_conntrack_tcp_timeout_syn_sent to 60s breaks ClusterIP scenarios:

setup requests EOF/request timeout availability failures success ratio
ClusterIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=60s 35792 9 0 0.999
PodIP minReadySeconds=0 Scylla=5.0.12 conntrack_timeout_syn_sent=60s 35688 2 0 0.999
PodIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=60s 36021 1 21 0.999
ClusterIP minReadySeconds=0 Scylla=5.1.18 conntrack_timeout_syn_sent=60s 37997 9 2620 0.93
ClusterIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=60s 36494 10 3055 0.92
PodIP minReadySeconds=0 Scylla=5.2.9 conntrack_timeout_syn_sent=60s 33809 3 10 0.999
ClusterIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=60s 38529 12 2816 0.93
PodIP minReadySeconds=0 Scylla=5.3.0-rc0 conntrack_timeout_syn_sent=60s 40817 2 9 0.999
PodIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=60s 41197 1 3 0.999
ClusterIP minReadySeconds=0 Scylla=5.4.0-rc0 conntrack_timeout_syn_sent=60s 42154 12 3599 0.91

Looks like setting net.netfilter.nf_conntrack_tcp_timeout_syn_sent to 20s would fix ClusterIP setups as it would enforce shorter timeout on rpc_client connection attempts. Adding configuration option to Scylla which would allow controlling this timeout would allow us to get rid of this workaround. We also need to fix Scylla, as supported versions have a bug causing availability issues when node is shutting down.

On GKE, maximum SYN packets are retransmissed 6 times, meaning from first SYN to failure it can take 127s

Just to understand it, could you explain the calculation here?

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

tcp_syn_retries - INTEGER Number of times initial SYNs for an active TCP connection attempt will be retransmitted. Should not be higher than 127. Default value is 6, which corresponds to 63seconds till the last retransmission with the current initial RTO of 1second. With this the final timeout for an active TCP connection attempt will happen after 127seconds.


Remember that such workaround would require running with additional linux capabilities

Our nodesetup pods are already running as root, so no extra permissions are required.

I also wonder if that wouldn’t affect stability of otherwise healthy clusters.

Removing only SYN_SENT entries to clusterIPs and Scylla ports after configurable and reasonable timeout should only cause more reconnection attempts.

I recall you mentioned 180s not being a value high enough to completely rid us of errors, which is quite surprising now given the 127s above. Was that caused by a higher syn_sent conntrack timeout?

That’s what I plan to look into next, maybe there’s something else causing disruption.

It’s worth explicitly stating that the issue comes from running kube-proxy depending on netfilter’s conntrack. From my understanding at this point we’re not sure if this is specific to running kube-proxy in iptables mode, or if that also occurs in ipvs mode. @zimnx have you tried it?

Nope

Like you said above, you haven’t hit the issue while running in GKE with Dataplane V2 (kube-proxy-less Cillium).

I haven’t tried with different node teardown logic. It’s something I want to tryout later as well.