rook: Can't write to a 2-replica hybrid storage pool when i lose one node
I have a 3-node rook cluster with 2 osds each, one ssd and one spinning disk, for a total of 6 osds
my ceph-filesystem is setup like this with the operator
- name: ceph-filesystem
spec:
metadataPool:
replicated:
size: 3
dataPools:
- failureDomain: host
parameters:
min_size: "1"
replicated:
size: 2
hybridStorage:
primaryDeviceClass: ssd
secondaryDeviceClass: hdd
metadataServer:
activeCount: 1
activeStandby: true
storageClass:
enabled: true
isDefault: false
name: ceph-filesystem
reclaimPolicy: Delete
parameters:
# The secrets contain Ceph admin credentials.
csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
# Specify the filesystem type of the volume. If not specified, csi-provisioner
# will set default as `ext4`. Note that `xfs` is not recommended due to potential deadlock
# in hyperconverged settings where the volume is mounted on the same node as the osds.
csi.storage.k8s.io/fstype: ext4
i would expect that losing one node should not impact the cluster operation, instead i can’t write to the rdb or the filesystem i set the min_size to 1 to allow writes to the cluster if 1 of the 2 replicas fail (please note that this is not a production cluster)
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior: losing one node prevents writes to the cluster
Expected behavior: if i lose one node, all operations should continue to work
How to reproduce it (minimal and precise):
File(s) to submit: version is rook-ceph-v1.7.8
this is my ceph cluster
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
annotations:
meta.helm.sh/release-name: rook-ceph-cluster
meta.helm.sh/release-namespace: rook-ceph
creationTimestamp: "2021-11-08T21:48:34Z"
finalizers:
- cephcluster.ceph.rook.io
generation: 2
labels:
app.kubernetes.io/managed-by: Helm
name: rook-ceph
namespace: rook-ceph
resourceVersion: "20950971"
uid: 631e6ebd-d2ad-4bce-93fa-4bbe55eb343a
spec:
cephVersion:
image: quay.io/ceph/ceph:v16.2.5
cleanupPolicy:
sanitizeDisks:
dataSource: zero
iteration: 1
method: quick
crashCollector: {}
dashboard:
enabled: true
dataDirHostPath: /var/lib/rook
disruptionManagement:
machineDisruptionBudgetNamespace: openshift-machine-api
managePodBudgets: true
osdMaintenanceTimeout: 30
external: {}
healthCheck:
daemonHealth:
mon:
interval: 45s
osd:
interval: 1m0s
status:
interval: 1m0s
livenessProbe:
mgr: {}
mon: {}
osd: {}
logCollector: {}
mgr:
count: 1
modules:
- enabled: true
name: pg_autoscaler
mon:
count: 3
monitoring:
enabled: true
rulesNamespace: rook-ceph
network: {}
security:
kms: {}
storage:
nodes:
- devices:
- config:
deviceClass: hdd
name: /dev/disk/by-id/scsi-SVMware_Virtual_disk_6000c295fd21a00ebf13e859663b32fc
- config:
deviceClass: ssd
name: /dev/disk/by-id/scsi-SVMware_Virtual_disk_6000c29c65a21ff65a047b88b76511ec
name: dev01
resources: {}
- devices:
- config:
deviceClass: hdd
name: /dev/disk/by-id/scsi-SVMware_Virtual_disk_6000c29c1215beb93ca246cc9c581a71
- config:
deviceClass: ssd
name: /dev/disk/by-id/scsi-SVMware_Virtual_disk_6000c29780b978f39f0a05c9e3a2e5c0
name: dev02
resources: {}
- devices:
- config:
deviceClass: hdd
name: /dev/disk/by-id/scsi-SVMware_Virtual_disk_6000c29ad59cc2f161e270430cf7cc91
- config:
deviceClass: ssd
name: /dev/disk/by-id/scsi-SVMware_Virtual_disk_6000c29c21c16ae507a97051610749a9
name: dev03
resources: {}
useAllDevices: false
waitTimeoutForHealthyOSDInMinutes: 10
status:
ceph:
capacity:
bytesAvailable: 144666574848
bytesTotal: 257672871936
bytesUsed: 113006297088
lastUpdated: "2021-12-02T17:41:03Z"
health: HEALTH_OK
lastChanged: "2021-12-02T17:37:56Z"
lastChecked: "2021-12-02T17:41:03Z"
previousHealth: HEALTH_WARN
versions:
mds:
ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 2
mgr:
ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 1
mon:
ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 3
osd:
ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 6
overall:
ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 12
conditions:
- lastHeartbeatTime: "2021-12-02T17:41:04Z"
lastTransitionTime: "2021-11-08T21:52:27Z"
message: Cluster created successfully
reason: ClusterCreated
status: "True"
type: Ready
message: Cluster created successfully
phase: Ready
state: Created
storage:
deviceClasses:
- name: hdd
- name: ssd
version:
image: quay.io/ceph/ceph:v16.2.5
version: 16.2.5-0
this is my ceph filesystem crd
apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
annotations:
meta.helm.sh/release-name: rook-ceph-cluster
meta.helm.sh/release-namespace: rook-ceph
creationTimestamp: "2021-11-08T21:48:34Z"
finalizers:
- cephfilesystem.ceph.rook.io
generation: 4
labels:
app.kubernetes.io/managed-by: Helm
name: ceph-filesystem
namespace: rook-ceph
resourceVersion: "20937676"
uid: e9364f77-4821-4c99-9a83-4238adbfe785
spec:
dataPools:
- failureDomain: host
parameters:
min_size: "1"
replicated:
hybridStorage:
primaryDeviceClass: ssd
secondaryDeviceClass: hdd
size: 2
metadataPool:
erasureCoded:
codingChunks: 0
dataChunks: 0
mirroring: {}
quotas: {}
replicated:
size: 3
statusCheck:
mirror: {}
metadataServer:
activeCount: 1
activeStandby: true
placement: {}
resources: {}
statusCheck:
mirror: {}
status:
phase: Ready
- Cluster CR (custom resource), typically called
cluster.yaml, if necessary - Operator’s logs, if necessary
- Crashing pod(s) logs, if necessary
To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read Github documentation if you need help.
Environment:
- OS (e.g. from /etc/os-release): NAME=“Ubuntu” VERSION=“20.04.3 LTS (Focal Fossa)”
- Kernel (e.g.
uname -a):Linux dev01 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux - Cloud provider or hardware configuration: kubernetes cluster on rke2
- Rook version (use
rook versioninside of a Rook Pod): rook: v1.7.3 - Storage backend version (e.g. for ceph do
ceph -v): ceph version 16.2.5 - Kubernetes version (use
kubectl version): Server Version: version.Info{Major:“1”, Minor:“21”, GitVersion:“v1.21.6+rke2r1”, - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): rke2
- Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox): HEALTH_OK
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (8 by maintainers)
Generally I’d advise against using cache tiering - it’s not often going to give the performance benefits you might expect (see https://docs.ceph.com/en/latest/rados/operations/cache-tiering/#known-bad-workloads)
In terms of separating SSD and HDD, assuming you’re doing this to benefit from the SSDs for read performance, the best approach today would be to use a standard single-level crush rule, e.g.
step chooseleaf firstn 2 type host, and use primary affinity, which operates above the crush layer.You can specify that SSDs should be primary, and HDDs should not. For example, you could set primary affinity for all the SSD-based OSDs to 1, and all the HDD-based OSDs to 0.
This isn’t a hard restriction - sometimes HDDs will still be primary, especially in failure conditions, but in normal circumstances you’ll benefit from having reads come from the SSDs.