rook: Can't write to a 2-replica hybrid storage pool when i lose one node

I have a 3-node rook cluster with 2 osds each, one ssd and one spinning disk, for a total of 6 osds

my ceph-filesystem is setup like this with the operator

- name: ceph-filesystem
        spec:
          metadataPool:
            replicated:
              size: 3
          dataPools:
            - failureDomain: host
              parameters:
                min_size: "1"
              replicated:
                size: 2
                hybridStorage:
                  primaryDeviceClass: ssd
                  secondaryDeviceClass: hdd
          metadataServer:
            activeCount: 1
            activeStandby: true
        storageClass:
          enabled: true
          isDefault: false
          name: ceph-filesystem
          reclaimPolicy: Delete
          parameters:
            # The secrets contain Ceph admin credentials.
            csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
            csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
            csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
            csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
            csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
            csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
            # Specify the filesystem type of the volume. If not specified, csi-provisioner
            # will set default as `ext4`. Note that `xfs` is not recommended due to potential deadlock
            # in hyperconverged settings where the volume is mounted on the same node as the osds.
            csi.storage.k8s.io/fstype: ext4

i would expect that losing one node should not impact the cluster operation, instead i can’t write to the rdb or the filesystem i set the min_size to 1 to allow writes to the cluster if 1 of the 2 replicas fail (please note that this is not a production cluster)

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: losing one node prevents writes to the cluster

Expected behavior: if i lose one node, all operations should continue to work

How to reproduce it (minimal and precise):

File(s) to submit: version is rook-ceph-v1.7.8

this is my ceph cluster

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  annotations:
    meta.helm.sh/release-name: rook-ceph-cluster
    meta.helm.sh/release-namespace: rook-ceph
  creationTimestamp: "2021-11-08T21:48:34Z"
  finalizers:
  - cephcluster.ceph.rook.io
  generation: 2
  labels:
    app.kubernetes.io/managed-by: Helm
  name: rook-ceph
  namespace: rook-ceph
  resourceVersion: "20950971"
  uid: 631e6ebd-d2ad-4bce-93fa-4bbe55eb343a
spec:
  cephVersion:
    image: quay.io/ceph/ceph:v16.2.5
  cleanupPolicy:
    sanitizeDisks:
      dataSource: zero
      iteration: 1
      method: quick
  crashCollector: {}
  dashboard:
    enabled: true
  dataDirHostPath: /var/lib/rook
  disruptionManagement:
    machineDisruptionBudgetNamespace: openshift-machine-api
    managePodBudgets: true
    osdMaintenanceTimeout: 30
  external: {}
  healthCheck:
    daemonHealth:
      mon:
        interval: 45s
      osd:
        interval: 1m0s
      status:
        interval: 1m0s
    livenessProbe:
      mgr: {}
      mon: {}
      osd: {}
  logCollector: {}
  mgr:
    count: 1
    modules:
    - enabled: true
      name: pg_autoscaler
  mon:
    count: 3
  monitoring:
    enabled: true
    rulesNamespace: rook-ceph
  network: {}
  security:
    kms: {}
  storage:
    nodes:
    - devices:
      - config:
          deviceClass: hdd
        name: /dev/disk/by-id/scsi-SVMware_Virtual_disk_6000c295fd21a00ebf13e859663b32fc
      - config:
          deviceClass: ssd
        name: /dev/disk/by-id/scsi-SVMware_Virtual_disk_6000c29c65a21ff65a047b88b76511ec
      name: dev01
      resources: {}
    - devices:
      - config:
          deviceClass: hdd
        name: /dev/disk/by-id/scsi-SVMware_Virtual_disk_6000c29c1215beb93ca246cc9c581a71
      - config:
          deviceClass: ssd
        name: /dev/disk/by-id/scsi-SVMware_Virtual_disk_6000c29780b978f39f0a05c9e3a2e5c0
      name: dev02
      resources: {}
    - devices:
      - config:
          deviceClass: hdd
        name: /dev/disk/by-id/scsi-SVMware_Virtual_disk_6000c29ad59cc2f161e270430cf7cc91
      - config:
          deviceClass: ssd
        name: /dev/disk/by-id/scsi-SVMware_Virtual_disk_6000c29c21c16ae507a97051610749a9
      name: dev03
      resources: {}
    useAllDevices: false
  waitTimeoutForHealthyOSDInMinutes: 10
status:
  ceph:
    capacity:
      bytesAvailable: 144666574848
      bytesTotal: 257672871936
      bytesUsed: 113006297088
      lastUpdated: "2021-12-02T17:41:03Z"
    health: HEALTH_OK
    lastChanged: "2021-12-02T17:37:56Z"
    lastChecked: "2021-12-02T17:41:03Z"
    previousHealth: HEALTH_WARN
    versions:
      mds:
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 2
      mgr:
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 1
      mon:
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 3
      osd:
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 6
      overall:
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 12
  conditions:
  - lastHeartbeatTime: "2021-12-02T17:41:04Z"
    lastTransitionTime: "2021-11-08T21:52:27Z"
    message: Cluster created successfully
    reason: ClusterCreated
    status: "True"
    type: Ready
  message: Cluster created successfully
  phase: Ready
  state: Created
  storage:
    deviceClasses:
    - name: hdd
    - name: ssd
  version:
    image: quay.io/ceph/ceph:v16.2.5
    version: 16.2.5-0

this is my ceph filesystem crd

apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  annotations:
    meta.helm.sh/release-name: rook-ceph-cluster
    meta.helm.sh/release-namespace: rook-ceph
  creationTimestamp: "2021-11-08T21:48:34Z"
  finalizers:
  - cephfilesystem.ceph.rook.io
  generation: 4
  labels:
    app.kubernetes.io/managed-by: Helm
  name: ceph-filesystem
  namespace: rook-ceph
  resourceVersion: "20937676"
  uid: e9364f77-4821-4c99-9a83-4238adbfe785
spec:
  dataPools:
  - failureDomain: host
    parameters:
      min_size: "1"
    replicated:
      hybridStorage:
        primaryDeviceClass: ssd
        secondaryDeviceClass: hdd
      size: 2
  metadataPool:
    erasureCoded:
      codingChunks: 0
      dataChunks: 0
    mirroring: {}
    quotas: {}
    replicated:
      size: 3
    statusCheck:
      mirror: {}
  metadataServer:
    activeCount: 1
    activeStandby: true
    placement: {}
    resources: {}
  statusCheck:
    mirror: {}
status:
  phase: Ready

Cluster CR (custom resource), typically called cluster.yaml, if necessary
Operator’s logs, if necessary
Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name> When pasting logs, always surround them with backticks or use the insert code button from the Github UI. Read Github documentation if you need help.

Environment:

OS (e.g. from /etc/os-release): NAME=“Ubuntu” VERSION=“20.04.3 LTS (Focal Fossa)”
Kernel (e.g. uname -a):Linux dev01 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Cloud provider or hardware configuration: kubernetes cluster on rke2
Rook version (use rook version inside of a Rook Pod): rook: v1.7.3
Storage backend version (e.g. for ceph do ceph -v): ceph version 16.2.5
Kubernetes version (use kubectl version): Server Version: version.Info{Major:“1”, Minor:“21”, GitVersion:“v1.21.6+rke2r1”,
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): rke2
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (8 by maintainers)

Most upvoted comments

Generally I’d advise against using cache tiering - it’s not often going to give the performance benefits you might expect (see https://docs.ceph.com/en/latest/rados/operations/cache-tiering/#known-bad-workloads)

In terms of separating SSD and HDD, assuming you’re doing this to benefit from the SSDs for read performance, the best approach today would be to use a standard single-level crush rule, e.g. step chooseleaf firstn 2 type host, and use primary affinity, which operates above the crush layer.

You can specify that SSDs should be primary, and HDDs should not. For example, you could set primary affinity for all the SSD-based OSDs to 1, and all the HDD-based OSDs to 0.

This isn’t a hard restriction - sometimes HDDs will still be primary, especially in failure conditions, but in normal circumstances you’ll benefit from having reads come from the SSDs.

jdurgin on Dec 8, 2021