rook: cluster-streched mode not surviving single zone failure

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

Expected behavior: When using cluster-streched mode, losing one single zone stops writing for all active pods in any remaining zone.

How to reproduce it (minimal and precise): Brand new cluster created to test cluster-stretched option

File(s) to submit:

operator-config.yml cluster-stretched.yml

Logs to submit:

Operator’s logs,

Cluster Status to submit:

  cluster:
    id:     cafe52f6-f6d4-44e4-bccd-1db086abc318
    health: HEALTH_WARN
            insufficient standby MDS daemons available
            1 MDSs report slow metadata IOs
            1 MDSs report slow requests
            2/5 mons down, quorum a,b,c
            243 slow ops, oldest one blocked for 817 sec, daemons [osd.0,osd.1,osd.3,mon.b] have slow ops.

  services:
    mon: 5 daemons, quorum a,b,c (age 13m), out of quorum: d, e
    mgr: b(active, since 2h)
    mds: 1/1 daemons up
    osd: 6 osds: 6 up (since 26m), 6 in (since 2h)

  data:
    volumes: 1/1 healthy
    pools:   6 pools, 160 pgs
    objects: 1.08k objects, 3.2 GiB
    usage:   13 GiB used, 107 GiB / 120 GiB avail
    pgs:     157 active+clean
             3   active+clean+laggy

Output of krew commands, if necessary

To get the health of the cluster, use kubectl rook-ceph health

ceph health
HEALTH_WARN insufficient standby MDS daemons available; 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; 2/5 mons down, quorum a,b,c; 111 slow ops, oldest one blocked for 847 sec, daemons [osd.0,osd.1,osd.3,mon.b] have slow ops.

Environment:

OS (e.g. from /etc/os-release):

AlmaLinux release 9.2 (Turquoise Kodkod)

Kernel (e.g. uname -a):

Linux master1 5.14.0-284.18.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jun 29 17:06:27 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

Cloud provider or hardware configuration: onPrem
Rook version (use rook version inside of a Rook Pod): rook version rook: v1.12.0 go: go1.20.5
Storage backend version (e.g. for ceph do ceph -v):

ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.5", GitCommit:"890a139214b4de1f01543d15003b5bda71aae9c7", GitTreeState:"clean", BuildDate:"2023-05-17T14:14:46Z", GoVersion:"go1.19.9", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.5", GitCommit:"890a139214b4de1f01543d15003b5bda71aae9c7", GitTreeState:"clean", BuildDate:"2023-05-17T14:08:49Z", GoVersion:"go1.19.9", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): kubespray onprem
kubectl get nodes

kubectl get nodes  -L topology.kubernetes.io/zone
NAME      STATUS     ROLES           AGE     VERSION   ZONE
master1   Ready      control-plane   7h      v1.26.5   dcc
master2   NotReady   control-plane   6h59m   v1.26.5   dc7
master3   Ready      control-plane   6h58m   v1.26.5   dca
node0     Ready      <none>          6h56m   v1.26.5   dca
node1     Ready      <none>          6h56m   v1.26.5   dcc
node2     NotReady   <none>          6h57m   v1.26.5   dc7
node3     Ready      <none>          6h57m   v1.26.5   dcc
node4     NotReady   <none>          6h57m   v1.26.5   dc7

slack initial discussion

About this issue

Original URL
State: closed
Created a year ago
Comments: 30 (5 by maintainers)

Most upvoted comments

Try ceph health detail to see which osd has it set, then ceph osd rm-noout osd.<ID>

travisn on Aug 8, 2023

btw, after a really long time ( I will try to get the exact time, but over 30 minutes i am sure ) the cluster status changed to:

  cluster:
    id:     cafe52f6-f6d4-44e4-bccd-1db086abc318
    health: HEALTH_WARN
            We are missing stretch mode buckets, only requiring 1 of 2 buckets to peer
            insufficient standby MDS daemons available
            2/5 mons down, quorum a,b,c
            3 osds down
            1 host (3 osds) down
            1 zone (3 osds) down
            Degraded data redundancy: 2398/4796 objects degraded (50.000%), 94 pgs degraded, 160 pgs undersized

  services:
    mon: 5 daemons, quorum a,b,c (age 15m), out of quorum: d, e
    mgr: b(active, since 3h)
    mds: 1/1 daemons up
    osd: 6 osds: 3 up (since 15m), 6 in (since 3h)

  data:
    volumes: 1/1 healthy
    pools:   6 pools, 160 pgs
    objects: 1.20k objects, 4.0 GiB
    usage:   16 GiB used, 104 GiB / 120 GiB avail
    pgs:     2398/4796 objects degraded (50.000%)
             94 active+undersized+degraded
             66 active+undersized

  io:
    client:   1.9 MiB/s wr, 0 op/s rd, 3 op/s wr

ID  CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
-1         0.11691  root default
-8         0.05846	zone dc7
-7         0.05846          host node2
 2    hdd  0.01949              osd.2     down   1.00000  1.00000
 4    hdd  0.01949              osd.4     down   1.00000  1.00000
 5    hdd  0.01949              osd.5     down   1.00000  1.00000
-4         0.05846	zone dcc
-3         0.05846          host node1
 0    hdd  0.01949              osd.0       up   1.00000  1.00000
 1    hdd  0.01949              osd.1       up   1.00000  1.00000
 3    hdd  0.01949              osd.3       up   1.00000  1.00000

and the writing resumed.

jsalatiel on Aug 3, 2023

@jsalatiel The stretch config looks correct, and the apps are running in the datacenter that are still up. Could you also provide: ceph osd tree ceph osd pool ls detail

@kamoltat Can you take a look at why the stretch cluster writes might not be working when one dc is down? What logs would help? thanks

travisn on Aug 2, 2023