rook: Replace or add new OSD device with dmcrypt fails
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior:
Replacing a device on a Node with a new device and encryptedDevice : true fails to create the OSD Deployment also adding a complete new device on a complete new node fails with the same errors
Expected behavior: The OSD should be created and the device should be usable for rook
How to reproduce it (minimal and precise):
- set up the rook ceph cluster with 3 devices (one per node) and 1 OSD per node and
encryptedDevice: "true" - remove one device from the cluster cr
- Ceph Health is going in HEALTH_WARN State because of osd_pool are lower then min_pools
- add a new device to the same node
- add the new device to the cluster cr and apply
- after the first fail, doing a device cleanup as documented here and restarting the operator
File(s) to submit:
- Cluster CR (custom resource), typically called
cluster.yaml, if necessary
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
cephVersion:
image: quay.io/ceph/ceph:v18.2
allowUnsupported: false
dataDirHostPath: /var/lib/rook
skipUpgradeChecks: false
continueUpgradeAfterChecksEvenIfNotHealthy: true
waitTimeoutForHealthyOSDInMinutes: 10
mon:
count: 3
allowMultiplePerNode: false
mgr:
count: 2
allowMultiplePerNode: false
modules:
- name: pg_autoscaler
enabled: true
dashboard:
enabled: true
ssl: true
monitoring:
enabled: true
network:
connections:
encryption:
enabled: false
compression:
enabled: false
requireMsgr2: false
crashCollector:
disable: false
daysToRetain: 30
logCollector:
enabled: false
periodicity: daily
maxLogSize: 500M
cleanupPolicy:
confirmation: ""
sanitizeDisks:
method: quick
dataSource: zero
iteration: 1
allowUninstallWithVolumes: false
annotations:
labels:
resources:
mgr:
limits:
cpu: "1000m"
memory: "1024Mi"
requests:
cpu: "600m"
memory: "320Mi"
mon:
limits:
cpu: "300m"
memory: "1024Mi"
requests:
cpu: "100m"
memory: "512Mi"
osd:
limits:
cpu: "1000m"
memory: "12Gi"
requests:
cpu: "200m"
memory: "10Gi"
crashcollector:
limits:
cpu: "100m"
memory: "100Mi"
requests:
cpu: "10m"
memory: "10Mi"
prepareosd:
limits:
cpu: "1000m"
memory: "100Mi"
requests:
cpu: "10m"
memory: "10Mi"
removeOSDsIfOutAndSafeToRemove: true
priorityClassNames:
mon: system-node-critical
osd: system-node-critical
mgr: system-cluster-critical
storage:
useAllNodes: false
useAllDevices: false
config:
encryptedDevice: "true" # the default value for this option is "false"
nodes:
- devices:
- name: sdb
name: node1
resources: {}
- devices:
- name: sdb
name: node2
resources: {}
- devices:
- name: sdb
name: node3
resources: {}
onlyApplyOSDPlacement: false
disruptionManagement:
managePodBudgets: true
osdMaintenanceTimeout: 30
pgHealthCheckTimeout: 0
healthCheck:
daemonHealth:
mon:
disabled: false
interval: 45s
osd:
disabled: false
interval: 60s
status:
disabled: false
interval: 60s
livenessProbe:
mon:
disabled: false
mgr:
disabled: false
osd:
disabled: false
startupProbe:
mon:
disabled: false
mgr:
disabled: false
osd:
disabled: false
Logs to submit:
-
Operator’s logs, if necessary pod-rook-ceph-operator.log
-
Crashing pod(s) logs, if necessary
-
Logs from the
rook-ceph-osd-0activate Container that fails with permission error pod-rook-ceph-osd-0-1.log -
the mon pods are logging this in response to the above error
debug 2023-09-23T08:35:01.928+0000 7fe32269d700 0 cephx server client.osd-lockbox.b9b2ee8f-d9fc-482b-ac0b-88afd3f02e98: couldn't find entity name: client.osd-lockbox.b9b2ee8f-d9fc-482b-ac0b-88afd3f02e98-
after manually adding an auth key for the
client.osd-lockbox.b9b2ee8f-d9fc-482b-ac0b-88afd3f02e98with the toolbox pod and adjusting the key in the file /var/lib/ceph/osd/ceph-0/lockbox.keyring of therook-ceph-osd-0activate Container i get this error. pod-rook-ceph-osd-0-2.log -
logs from the prepare pod pod-rook-ceph-node-prepare.log
-
Cluster Status to submit:
- from toolbox pod
cbash-4.4$ ceph status
cluster:
id: 4aa035b9-ef3e-4f74-bca7-e296981022cb
health: HEALTH_WARN
Degraded data redundancy: 5/2001671 objects degraded (0.000%), 1 pg degraded, 1 pg undersized
OSD count 2 < osd_pool_default_size 3
services:
mon: 3 daemons, quorum d,j,k (age 13h)
mgr: a(active, since 68m), standbys: b
mds: 1/1 daemons up, 1 hot standby
osd: 2 osds: 2 up (since 2d), 2 in (since 26h)
data:
volumes: 1/1 healthy
pools: 4 pools, 97 pgs
objects: 1.00M objects, 43 GiB
usage: 105 GiB used, 95 GiB / 200 GiB avail
pgs: 5/2001671 objects degraded (0.000%)
96 active+clean
1 active+undersized+degraded
io:
client: 684 KiB/s rd, 2.0 MiB/s wr, 4 op/s rd, 55 op/s wr
bash-4.4$ ceph health detail
HEALTH_WARN Degraded data redundancy: 5/2002449 objects degraded (0.000%), 1 pg degraded, 1 pg undersized; OSD count 2 < osd_pool_default_size 3
[WRN] PG_DEGRADED: Degraded data redundancy: 5/2002449 objects degraded (0.000%), 1 pg degraded, 1 pg undersized
pg 1.0 is stuck undersized for 27h, current state active+undersized+degraded, last acting [1,2]
[WRN] TOO_FEW_OSDS: OSD count 2 < osd_pool_default_size 3
bash-4.4$
- pod status kubectl get pod -n rook-ceph
NAME READY STATUS RESTARTS AGE
csi-cephfsplugin-4h7nz 2/2 Running 7 43d
csi-cephfsplugin-52rc9 2/2 Running 2 26h
csi-cephfsplugin-57vvn 2/2 Running 9 43d
csi-cephfsplugin-provisioner-86788ff996-bvkzw 5/5 Running 0 6d22h
csi-cephfsplugin-provisioner-86788ff996-v9nqd 5/5 Running 0 2d
csi-cephfsplugin-zkk2v 2/2 Running 13 43d
csi-rbdplugin-bhgs6 2/2 Running 2 26h
csi-rbdplugin-fb7ct 2/2 Running 8 43d
csi-rbdplugin-lh6s5 2/2 Running 9 43d
csi-rbdplugin-m95xs 2/2 Running 13 43d
csi-rbdplugin-provisioner-7b5494c7fd-ts4wz 5/5 Running 0 6d22h
csi-rbdplugin-provisioner-7b5494c7fd-vgtf5 5/5 Running 0 6d22h
rook-ceph-crashcollector-node1-7745945cfc-5bgnh 1/1 Running 0 6d22h
rook-ceph-crashcollector-node2-767685cbb4-dw6lc 1/1 Running 0 13h
rook-ceph-crashcollector-node3-67b9b8d77f-w45ch 1/1 Running 0 13h
rook-ceph-crashcollector-node4-778768bcf4-q6qcs 1/1 Running 0 6d22h
rook-ceph-crashcollector-pruner-28254240-npf9n 0/1 Completed 0 2d9h
rook-ceph-crashcollector-pruner-28257120-mrqsb 0/1 Completed 0 9h
rook-ceph-exporter-node1-7d5654c7cd-t82kg 1/1 Running 0 6d22h
rook-ceph-exporter-node2-666b6768cf-xv2kj 1/1 Running 0 13h
rook-ceph-exporter-node3-856fb8f488-h5w7h 1/1 Running 0 13h
rook-ceph-exporter-node4-d88975f45-q8ctz 1/1 Running 0 6d22h
rook-ceph-mds-myfs-a-5d7d795dcf-5wzvt 1/1 Running 0 6d22h
rook-ceph-mds-myfs-b-5c68b69d49-frz76 1/1 Running 0 13h
rook-ceph-mgr-a-66d569dd66-mfhfs 2/2 Running 0 6d22h
rook-ceph-mgr-b-7c4649b9b8-wgqdc 2/2 Running 0 13h
rook-ceph-mon-d-6df45468c8-bq8nz 1/1 Running 0 6d22h
rook-ceph-mon-j-86c7698c85-9vb8v 1/1 Running 0 26h
rook-ceph-mon-k-79fdb8b95d-qv67p 1/1 Running 0 6d22h
rook-ceph-osd-0-746b596d65-shjv8 0/1 Init:CrashLoopBackOff 8 (3m13s ago) 37m
rook-ceph-osd-1-7bf949c575-5dj54 1/1 Running 0 6d22h
rook-ceph-osd-2-54ccb6fc47-skc2s 1/1 Running 0 6d22h
rook-ceph-osd-prepare-node1-gmp84 0/1 Completed 0 73m
rook-ceph-osd-prepare-node2-28ptj 0/1 Completed 0 73m
rook-ceph-osd-prepare-node3-kzg56 0/1 Completed 0 73m
rook-ceph-tools-768c997484-c697l 1/1 Running 0 100m
Environment:
- OS (e.g. from /etc/os-release): Ubuntu 20.04.6 LTS
- Kernel (e.g.
uname -a): 5.4.0-162-generic 179-Ubuntu - Cloud provider or hardware configuration: hetzner Cloud VM
- Rook version (use
rook versioninside of a Rook Pod): rook: v1.12.4, go: go1.21.1 - Storage backend version (e.g. for ceph do
ceph -v): ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) - Kubernetes version (use
kubectl version): Client Version: v1.28.2, Server Version: v1.28.2 - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Barebone setup with kubedadm
- Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox): HEALTH_WARN Degraded data redundancy: 5/2009713 objects degraded (0.000%), 1 pg degraded, 1 pg undersized; OSD count 2 < osd_pool_default_size 3 - lsblk:
sdb 8:16 0 200G 0 disk
`-ceph--7a14e7f1--cfda--4938--bf3d--2fda3d889325-osd--block--b9b2ee8f--d9fc--482b--ac0b--88afd3f02e98 253:0 0 200G 0 lvm
- dmsetup info
Name: ceph--7a14e7f1--cfda--4938--bf3d--2fda3d889325-osd--block--b9b2ee8f--d9fc--482b--ac0b--88afd3f02e98
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 0
Event number: 0
Major, minor: 253, 0
Number of targets: 1
UUID: LVM-4NYf3DfACfiq4zF9853msg0dVDqxFJ4seTe4RNaTQ1s6R3gdfsvZgH1VPWu238Ty
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 18 (11 by maintainers)
The Host Filesystem is not full. here is the output from df
Also I have mounted a fully new volume (sdc) to the node that has no filesystem and no partition on it. The above Problem stays the same.
do you have any suggestions what i should check on the host system, to identify the problem?
No we don’t use this feature.
The same error occurs. The Operator is doing all the reconiling stuff but activate container of the osd.0 is again writing the
attached see the operator logs pod-rook-ceph-operator-2.log