rook: cluster-streched mode not surviving single zone failure
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior:
Expected behavior: When using cluster-streched mode, losing one single zone stops writing for all active pods in any remaining zone.
How to reproduce it (minimal and precise): Brand new cluster created to test cluster-stretched option
File(s) to submit:
operator-config.yml cluster-stretched.yml
Logs to submit:
Cluster Status to submit:
cluster:
id: cafe52f6-f6d4-44e4-bccd-1db086abc318
health: HEALTH_WARN
insufficient standby MDS daemons available
1 MDSs report slow metadata IOs
1 MDSs report slow requests
2/5 mons down, quorum a,b,c
243 slow ops, oldest one blocked for 817 sec, daemons [osd.0,osd.1,osd.3,mon.b] have slow ops.
services:
mon: 5 daemons, quorum a,b,c (age 13m), out of quorum: d, e
mgr: b(active, since 2h)
mds: 1/1 daemons up
osd: 6 osds: 6 up (since 26m), 6 in (since 2h)
data:
volumes: 1/1 healthy
pools: 6 pools, 160 pgs
objects: 1.08k objects, 3.2 GiB
usage: 13 GiB used, 107 GiB / 120 GiB avail
pgs: 157 active+clean
3 active+clean+laggy
-
Output of krew commands, if necessary
To get the health of the cluster, use
kubectl rook-ceph health
ceph health
HEALTH_WARN insufficient standby MDS daemons available; 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 2/5 mons down, quorum a,b,c; 111 slow ops, oldest one blocked for 847 sec, daemons [osd.0,osd.1,osd.3,mon.b] have slow ops.
Environment:
- OS (e.g. from /etc/os-release):
AlmaLinux release 9.2 (Turquoise Kodkod)
- Kernel (e.g.
uname -a):
Linux master1 5.14.0-284.18.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jun 29 17:06:27 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
- Cloud provider or hardware configuration:
onPrem - Rook version (use
rook versioninside of a Rook Pod): rook version rook: v1.12.0 go: go1.20.5 - Storage backend version (e.g. for ceph do
ceph -v):
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
- Kubernetes version (use
kubectl version):
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.5", GitCommit:"890a139214b4de1f01543d15003b5bda71aae9c7", GitTreeState:"clean", BuildDate:"2023-05-17T14:14:46Z", GoVersion:"go1.19.9", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.5", GitCommit:"890a139214b4de1f01543d15003b5bda71aae9c7", GitTreeState:"clean", BuildDate:"2023-05-17T14:08:49Z", GoVersion:"go1.19.9", Compiler:"gc", Platform:"linux/amd64"}
-
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
kubespray onprem -
kubectl get nodes
kubectl get nodes -L topology.kubernetes.io/zone
NAME STATUS ROLES AGE VERSION ZONE
master1 Ready control-plane 7h v1.26.5 dcc
master2 NotReady control-plane 6h59m v1.26.5 dc7
master3 Ready control-plane 6h58m v1.26.5 dca
node0 Ready <none> 6h56m v1.26.5 dca
node1 Ready <none> 6h56m v1.26.5 dcc
node2 NotReady <none> 6h57m v1.26.5 dc7
node3 Ready <none> 6h57m v1.26.5 dcc
node4 NotReady <none> 6h57m v1.26.5 dc7
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 30 (5 by maintainers)
Try
ceph health detailto see which osd has it set, thenceph osd rm-noout osd.<ID>btw, after a really long time ( I will try to get the exact time, but over 30 minutes i am sure ) the cluster status changed to:
and the writing resumed.
@jsalatiel The stretch config looks correct, and the apps are running in the datacenter that are still up. Could you also provide:
ceph osd treeceph osd pool ls detail@kamoltat Can you take a look at why the stretch cluster writes might not be working when one dc is down? What logs would help? thanks