rook: OSD prepare job fails with KeyError: 'KNAME'

I removed a broken OSD following the guide. Then I deleted the corresponding block-mode PV, PVC, and OSD Deployment. Then I closed the OSD’s LUKS volume via cryptsetup close and wiped the disk via wipefs -a. Afterwards I recreated the PV and expected Rook to automatically recreate the OSD. However, the OSD prepare job fails and its log contains the following backtrace:

[2022-11-13 18:27:04,778][ceph_volume.devices.raw.prepare][ERROR ] raw prepare was unable to complete
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 91, in safe_prepare
    self.prepare()
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 134, in prepare
    tmpfs,
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 51, in prepare_bluestore
    block = prepare_dmcrypt(key, block, 'block', fsid)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 23, in prepare_dmcrypt
    kname = disk.lsblk(device)['KNAME']
KeyError: 'KNAME'
[2022-11-13 18:27:04,780][ceph_volume.devices.raw.prepare][INFO  ] will rollback OSD ID creation
[2022-11-13 18:27:04,781][ceph_volume.process][INFO  ] Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.4 --yes-i-really-mean-it
[2022-11-13 18:27:05,553][ceph_volume.process][INFO  ] stderr purged osd.4
[2022-11-13 18:27:05,571][ceph_volume.process][INFO  ] Running command: /usr/bin/systemctl is-active ceph-osd@4
[2022-11-13 18:27:05,584][ceph_volume.process][INFO  ] stderr System has not been booted with systemd as init system (PID 1). Can't operate.
[2022-11-13 18:27:05,585][ceph_volume.process][INFO  ] stderr Failed to connect to bus: Host is down
[2022-11-13 18:27:05,589][ceph_volume.util.system][WARNING] Executable lvs not found on the host, will return lvs as-is
[2022-11-13 18:27:05,590][ceph_volume.process][INFO  ] Running command: lvs --noheadings --readonly --separator=";" -a --units=b --nosuffix -S tags={ceph.osd_id=4} -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size
[2022-11-13 18:27:05,969][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 91, in safe_prepare
    self.prepare()
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 134, in prepare
    tmpfs,
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 51, in prepare_bluestore
    block = prepare_dmcrypt(key, block, 'block', fsid)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 23, in prepare_dmcrypt
    kname = disk.lsblk(device)['KNAME']
KeyError: 'KNAME'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 169, in main
    self.safe_prepare(self.args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 95, in safe_prepare
    rollback_osd(self.args, self.osd_id)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/common.py", line 35, in rollback_osd
    Zap(['--destroy', '--osd-id', osd_id]).main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py", line 404, in main
    self.zap_osd()
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py", line 301, in zap_osd
    devices = find_associated_devices(self.args.osd_id, self.args.osd_fsid)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py", line 88, in find_associated_devices
    '%s' % osd_id or osd_fsid)
RuntimeError: Unable to find any LV for zapping OSD: 4

The full log is attached in a GitHub Gist below.

Is this a bug report or feature request? Bug Report

Deviation from expected behavior: OSD preparation is unsuccessful.

File(s) to submit: Cluster CR: https://gist.github.com/haslersn/57251739d58ee88dd643237cc847e16e#file-cluster-yaml

Logs to submit: Crashing rook-ceph-osd-prepare pod logs: https://gist.github.com/haslersn/57251739d58ee88dd643237cc847e16e#file-rook-ceph-osd-prepare-ssd-sata-0-data-8mqrmn

Cluster Status to submit:

$ ceph status
  cluster:
    id:     e6e99116-5ed6-4b09-b6cd-47b989beb3dd
    health: HEALTH_WARN
            342 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum a,e,f (age 49m)
    mgr: a(active, since 4h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 19 osds: 19 up (since 49m), 19 in (since 49m)
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 97 pgs
    objects: 14.12k objects, 2.7 GiB
    usage:   8.1 GiB used, 25 TiB / 25 TiB avail
    pgs:     97 active+clean
 
  io:
    client:   2.7 KiB/s rd, 1.2 KiB/s wr, 2 op/s rd, 0 op/s wr

(I think the 342 daemons have recently crashed warning is related to the former OSD which I purged from the cluster. It was in a state where the OSD Pod repeatedly crashed immediately after startup, until limited by CrashLoopBackOff. The other 19 OSDs are not affected.)

Environment:

  • OS: Debian GNU/Linux 11 (bullseye)
  • Kernel: 5.10.0-15-amd64, Debian 5.10.120-1 (2022-06-09), x86_64 GNU/Linux
  • Cloud provider or hardware configuration: bare metal
  • Rook version: v1.10.5
  • Storage backend version: 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
  • Kubernetes version: v1.21.7
  • Kubernetes cluster type: kubeadm

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19 (11 by maintainers)

Commits related to this issue

Most upvoted comments

I think I found the underlying issue:

In Ceph v17.2.3 the following worked:

root@kone03:~# ctr run --rm -t quay.io/ceph/ceph:v17.2.3 bash
[root@kone03 /]# mknod /mnt/block b 8 32
[root@kone03 /]# python3
Python 3.6.8 (default, Jun 23 2022, 19:01:59) 
[GCC 8.5.0 20210514 (Red Hat 8.5.0-13)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from ceph_volume.util import disk
>>> disk.lsblk("/mnt/block")
{'NAME': 'sdc', 'KNAME': 'sdc', 'MAJ:MIN': '8:32', 'FSTYPE': '', 'MOUNTPOINT': '', 'LABEL': '', 'UUID': '', 'RO': '0', 'RM': '0', 'MODEL': 'SanDisk SDSSDH3', 'SIZE': '1.8T', 'STATE': 'running', 'OWNER': '', 'GROUP': '', 'MODE': '', 'ALIGNMENT': '0', 'PHY-SEC': '512', 'LOG-SEC': '512', 'ROTA': '0', 'SCHED': 'mq-deadline', 'TYPE': 'disk', 'DISC-ALN': '0', 'DISC-GRAN': '512B', 'DISC-MAX': '2G', 'DISC-ZERO': '0', 'PKNAME': '', 'PARTLABEL': ''}

But in Ceph v17.2.4 upwards, I get an empty result:

root@kone03:~# ctr run --rm -t quay.io/ceph/ceph:v17.2.4 bash
[root@kone03 /]# mknod /mnt/block b 8 32
[root@kone03 /]# python3
Python 3.6.8 (default, Jun 23 2022, 19:01:59) 
[GCC 8.5.0 20210514 (Red Hat 8.5.0-13)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from ceph_volume.util import disk
>>> disk.lsblk("/mnt/block")
{}

But it does work when specifying the udev-created block device file:

>>> disk.lsblk("/dev/sdc")
{'NAME': 'sdc', 'KNAME': 'sdc', 'PKNAME': '', 'MAJ:MIN': '8:32', 'FSTYPE': '', 'MOUNTPOINT': '', 'LABEL': '', 'UUID': '', 'RO': '0', 'RM': '0', 'MODEL': 'SanDisk SDSSDH3', 'SIZE': '1.8T', 'STATE': 'running', 'OWNER': '', 'GROUP': '', 'MODE': '', 'ALIGNMENT': '0', 'PHY-SEC': '512', 'LOG-SEC': '512', 'ROTA': '0', 'SCHED': 'mq-deadline', 'TYPE': 'disk', 'DISC-ALN': '0', 'DISC-GRAN': '512B', 'DISC-MAX': '2G', 'DISC-ZERO': '0', 'PARTLABEL': ''}

It’s merged but not released.

yes, but what I’m saying is that it’s merged in ‘main’ but not in quincy/pacific

@guits Could you take a look?

sure