rook: Replace or add new OSD device with dmcrypt fails

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: Replacing a device on a Node with a new device and encryptedDevice : true fails to create the OSD Deployment also adding a complete new device on a complete new node fails with the same errors

Expected behavior: The OSD should be created and the device should be usable for rook

How to reproduce it (minimal and precise):

set up the rook ceph cluster with 3 devices (one per node) and 1 OSD per node and encryptedDevice: "true"
remove one device from the cluster cr
- Ceph Health is going in HEALTH_WARN State because of osd_pool are lower then min_pools
add a new device to the same node
add the new device to the cluster cr and apply
after the first fail, doing a device cleanup as documented here and restarting the operator

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph 
spec:
  cephVersion:
    image: quay.io/ceph/ceph:v18.2
    allowUnsupported: false
  dataDirHostPath: /var/lib/rook
  skipUpgradeChecks: false
  continueUpgradeAfterChecksEvenIfNotHealthy: true
  waitTimeoutForHealthyOSDInMinutes: 10
  mon:
    count: 3
    allowMultiplePerNode: false
  mgr:
    count: 2
    allowMultiplePerNode: false
    modules:
      - name: pg_autoscaler
        enabled: true
  dashboard:
    enabled: true
    ssl: true
  monitoring:
    enabled: true
  network:
    connections:
      encryption:
        enabled: false
      compression:
        enabled: false
      requireMsgr2: false
  crashCollector:
    disable: false
    daysToRetain: 30
  logCollector:
    enabled: false
    periodicity: daily 
    maxLogSize: 500M 
  cleanupPolicy:
    confirmation: ""
    sanitizeDisks:
      method: quick
      dataSource: zero
      iteration: 1
    allowUninstallWithVolumes: false
  annotations:
  labels:
  resources:
    mgr:
      limits:
        cpu: "1000m"
        memory: "1024Mi"
      requests:
        cpu: "600m"
        memory: "320Mi"
    mon:
      limits:
        cpu: "300m"
        memory: "1024Mi"
      requests:
        cpu: "100m"
        memory: "512Mi"
    osd:
      limits:
        cpu: "1000m"
        memory: "12Gi"
      requests:
        cpu: "200m"
        memory: "10Gi"
    crashcollector:
      limits:
        cpu: "100m"
        memory: "100Mi"
      requests:
        cpu: "10m"
        memory: "10Mi"
    prepareosd:
      limits:
        cpu: "1000m"
        memory: "100Mi"
      requests:
        cpu: "10m"
        memory: "10Mi"
  removeOSDsIfOutAndSafeToRemove: true
  priorityClassNames:
    mon: system-node-critical
    osd: system-node-critical
    mgr: system-cluster-critical
  storage: 
    useAllNodes: false
    useAllDevices: false
    config:
      encryptedDevice: "true" # the default value for this option is "false"
    nodes:
    - devices:
      - name: sdb
      name: node1
      resources: {}
    - devices:
      - name: sdb
      name: node2
      resources: {}
    - devices:
      - name: sdb
      name: node3
      resources: {}
    onlyApplyOSDPlacement: false
  disruptionManagement:
    managePodBudgets: true
    osdMaintenanceTimeout: 30
    pgHealthCheckTimeout: 0
  healthCheck:
    daemonHealth:
      mon:
        disabled: false
        interval: 45s
      osd:
        disabled: false
        interval: 60s
      status:
        disabled: false
        interval: 60s
    livenessProbe:
      mon:
        disabled: false
      mgr:
        disabled: false
      osd:
        disabled: false
    startupProbe:
      mon:
        disabled: false
      mgr:
        disabled: false
      osd:
        disabled: false

Logs to submit:

Operator’s logs, if necessary pod-rook-ceph-operator.log
Crashing pod(s) logs, if necessary
- Logs from the rook-ceph-osd-0 activate Container that fails with permission error pod-rook-ceph-osd-0-1.log
- the mon pods are logging this in response to the above error
```
debug 2023-09-23T08:35:01.928+0000 7fe32269d700  0 cephx server client.osd-lockbox.b9b2ee8f-d9fc-482b-ac0b-88afd3f02e98: couldn't find entity name: client.osd-lockbox.b9b2ee8f-d9fc-482b-ac0b-88afd3f02e98
```
- after manually adding an auth key for the client.osd-lockbox.b9b2ee8f-d9fc-482b-ac0b-88afd3f02e98 with the toolbox pod and adjusting the key in the file /var/lib/ceph/osd/ceph-0/lockbox.keyring of the rook-ceph-osd-0 activate Container i get this error. pod-rook-ceph-osd-0-2.log
- logs from the prepare pod pod-rook-ceph-node-prepare.log

Cluster Status to submit:

from toolbox pod

cbash-4.4$ ceph status
  cluster:
    id:     4aa035b9-ef3e-4f74-bca7-e296981022cb
    health: HEALTH_WARN
            Degraded data redundancy: 5/2001671 objects degraded (0.000%), 1 pg degraded, 1 pg undersized
            OSD count 2 < osd_pool_default_size 3

  services:
    mon: 3 daemons, quorum d,j,k (age 13h)
    mgr: a(active, since 68m), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 2 osds: 2 up (since 2d), 2 in (since 26h)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 1.00M objects, 43 GiB
    usage:   105 GiB used, 95 GiB / 200 GiB avail
    pgs:     5/2001671 objects degraded (0.000%)
             96 active+clean
             1  active+undersized+degraded

  io:
    client:   684 KiB/s rd, 2.0 MiB/s wr, 4 op/s rd, 55 op/s wr

bash-4.4$ ceph health detail
HEALTH_WARN Degraded data redundancy: 5/2002449 objects degraded (0.000%), 1 pg degraded, 1 pg undersized; OSD count 2 < osd_pool_default_size 3
[WRN] PG_DEGRADED: Degraded data redundancy: 5/2002449 objects degraded (0.000%), 1 pg degraded, 1 pg undersized
    pg 1.0 is stuck undersized for 27h, current state active+undersized+degraded, last acting [1,2]
[WRN] TOO_FEW_OSDS: OSD count 2 < osd_pool_default_size 3
bash-4.4$

pod status kubectl get pod -n rook-ceph

NAME                                              READY   STATUS                  RESTARTS        AGE
csi-cephfsplugin-4h7nz                            2/2     Running                 7               43d
csi-cephfsplugin-52rc9                            2/2     Running                 2               26h
csi-cephfsplugin-57vvn                            2/2     Running                 9               43d
csi-cephfsplugin-provisioner-86788ff996-bvkzw     5/5     Running                 0               6d22h
csi-cephfsplugin-provisioner-86788ff996-v9nqd     5/5     Running                 0               2d
csi-cephfsplugin-zkk2v                            2/2     Running                 13              43d
csi-rbdplugin-bhgs6                               2/2     Running                 2               26h
csi-rbdplugin-fb7ct                               2/2     Running                 8               43d
csi-rbdplugin-lh6s5                               2/2     Running                 9               43d
csi-rbdplugin-m95xs                               2/2     Running                 13              43d
csi-rbdplugin-provisioner-7b5494c7fd-ts4wz        5/5     Running                 0               6d22h
csi-rbdplugin-provisioner-7b5494c7fd-vgtf5        5/5     Running                 0               6d22h
rook-ceph-crashcollector-node1-7745945cfc-5bgnh   1/1     Running                 0               6d22h
rook-ceph-crashcollector-node2-767685cbb4-dw6lc   1/1     Running                 0               13h
rook-ceph-crashcollector-node3-67b9b8d77f-w45ch   1/1     Running                 0               13h
rook-ceph-crashcollector-node4-778768bcf4-q6qcs   1/1     Running                 0               6d22h
rook-ceph-crashcollector-pruner-28254240-npf9n    0/1     Completed               0               2d9h
rook-ceph-crashcollector-pruner-28257120-mrqsb    0/1     Completed               0               9h
rook-ceph-exporter-node1-7d5654c7cd-t82kg         1/1     Running                 0               6d22h
rook-ceph-exporter-node2-666b6768cf-xv2kj         1/1     Running                 0               13h
rook-ceph-exporter-node3-856fb8f488-h5w7h         1/1     Running                 0               13h
rook-ceph-exporter-node4-d88975f45-q8ctz          1/1     Running                 0               6d22h
rook-ceph-mds-myfs-a-5d7d795dcf-5wzvt             1/1     Running                 0               6d22h
rook-ceph-mds-myfs-b-5c68b69d49-frz76             1/1     Running                 0               13h
rook-ceph-mgr-a-66d569dd66-mfhfs                  2/2     Running                 0               6d22h
rook-ceph-mgr-b-7c4649b9b8-wgqdc                  2/2     Running                 0               13h
rook-ceph-mon-d-6df45468c8-bq8nz                  1/1     Running                 0               6d22h
rook-ceph-mon-j-86c7698c85-9vb8v                  1/1     Running                 0               26h
rook-ceph-mon-k-79fdb8b95d-qv67p                  1/1     Running                 0               6d22h
rook-ceph-osd-0-746b596d65-shjv8                  0/1     Init:CrashLoopBackOff   8 (3m13s ago)   37m
rook-ceph-osd-1-7bf949c575-5dj54                  1/1     Running                 0               6d22h
rook-ceph-osd-2-54ccb6fc47-skc2s                  1/1     Running                 0               6d22h
rook-ceph-osd-prepare-node1-gmp84                 0/1     Completed               0               73m
rook-ceph-osd-prepare-node2-28ptj                 0/1     Completed               0               73m
rook-ceph-osd-prepare-node3-kzg56                 0/1     Completed               0               73m
rook-ceph-tools-768c997484-c697l                  1/1     Running                 0               100m

Environment:

OS (e.g. from /etc/os-release): Ubuntu 20.04.6 LTS
Kernel (e.g. uname -a): 5.4.0-162-generic 179-Ubuntu
Cloud provider or hardware configuration: hetzner Cloud VM
Rook version (use rook version inside of a Rook Pod): rook: v1.12.4, go: go1.21.1
Storage backend version (e.g. for ceph do ceph -v): ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
Kubernetes version (use kubectl version): Client Version: v1.28.2, Server Version: v1.28.2
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Barebone setup with kubedadm
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_WARN Degraded data redundancy: 5/2009713 objects degraded (0.000%), 1 pg degraded, 1 pg undersized; OSD count 2 < osd_pool_default_size 3
lsblk:

sdb                                                                                                     8:16   0   200G  0 disk
`-ceph--7a14e7f1--cfda--4938--bf3d--2fda3d889325-osd--block--b9b2ee8f--d9fc--482b--ac0b--88afd3f02e98 253:0    0   200G  0 lvm

dmsetup info

Name:              ceph--7a14e7f1--cfda--4938--bf3d--2fda3d889325-osd--block--b9b2ee8f--d9fc--482b--ac0b--88afd3f02e98
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        0
Event number:      0
Major, minor:      253, 0
Number of targets: 1
UUID: LVM-4NYf3DfACfiq4zF9853msg0dVDqxFJ4seTe4RNaTQ1s6R3gdfsvZgH1VPWu238Ty

About this issue

Original URL
State: closed
Created 9 months ago
Comments: 18 (11 by maintainers)

Most upvoted comments

The Host Filesystem is not full. here is the output from df

Filesystem   Size  Used Avail Use% Mounted on
udev          16G     0   16G   0% /dev
tmpfs        3.1G   21M  3.1G   1% /run
/dev/sda1    338G   48G  277G  15% /
tmpfs         16G     0   16G   0% /dev/shm
tmpfs        5.0M     0  5.0M   0% /run/lock
tmpfs         16G     0   16G   0% /sys/fs/cgroup

Also I have mounted a fully new volume (sdc) to the node that has no filesystem and no partition on it. The above Problem stays the same.

do you have any suggestions what i should check on the host system, to identify the problem?

grassdionera on Oct 5, 2023

No we don’t use this feature.

grassdionera on Sep 25, 2023

The same error occurs. The Operator is doing all the reconiling stuff but activate container of the osd.0 is again writing the

RADOS permission denied (error connecting to the cluster)

attached see the operator logs pod-rook-ceph-operator-2.log

grassdionera on Sep 25, 2023