rook: Ceph v17.2.7 fails to create or start OSDs configured on PVs or LVs

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior: Ceph v17.2.7 is failing to start OSDs in certain configurations.

Expected behavior: All OSD configurations should successfully be created and upgraded.

How to reproduce it (minimal and precise):

Create a v17.2.7 cluster or upgrade to Ceph v17.2.7 with OSDs on LVs or PVs.

The symptoms are that the OSDs fail to be created, as seen in the OSD prepare log:

[2023-10-31 17:29:08,090][ceph_volume.devices.raw.list][DEBUG ] inspecting devices: ['/mnt/set1-data-0bv5xb']
[2023-10-31 17:29:08,090][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 166, in main
    self.list(args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 122, in list
    report = self.generate(args.device)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 91, in generate
    info_device = [info for info in info_devices if info['NAME'] == dev][0]
IndexError: list index out of range
2023-10-31 17:29:08.150895 C | rookcmd: failed to configure devices: failed to get device already provisioned by ceph-volume raw: failed to retrieve ceph-volume raw list results: failed ceph-volume call (see ceph-volume log above for more details): exit status 1

Or when an existing LV-based OSD starts, this issue was observed in the activate init container where the disk is not found:

activate + OSD_ID=0
stream logs failed container "expand-bluefs" in pod "rook-ceph-osd-0-d7bdb7874-snkx6" is waiting to start: PodInitializing for rook-ceph/rook-ceph-osd-0-d7bdb7874-snkx6 (expand-bluefs)
stream logs failed container "chown-container-data-dir" in pod "rook-ceph-osd-0-d7bdb7874-snkx6" is waiting to start: PodInitializing for rook-ceph/rook-ceph-osd-0-d7bdb7874-snkx6 (chown-container-data-dir)
stream logs failed container "osd" in pod "rook-ceph-osd-0-d7bdb7874-snkx6" is waiting to start: PodInitializing for rook-ceph/rook-ceph-osd-0-d7bdb7874-snkx6 (osd)
activate + CEPH_FSID=dda6c867-4047-4fb9-a744-1575da352c5d
activate + OSD_UUID=447c4553-d301-457a-b48d-69c1a6afe74a
activate + OSD_STORE_FLAG=--bluestore
activate + OSD_DATA_DIR=/var/lib/ceph/osd/ceph-0
activate + CV_MODE=raw
activate + DEVICE=/dev/mapper/vg_rook-lv_rook3
activate + cp --no-preserve=mode /etc/temp-ceph/ceph.conf /etc/ceph/ceph.conf
activate + python3 -c '
activate import configparser
activate 
activate config = configparser.ConfigParser()
activate config.read('\''/etc/ceph/ceph.conf'\'')
stream logs failed container "log-collector" in pod "rook-ceph-osd-0-d7bdb7874-snkx6" is waiting to start: PodInitializing for rook-ceph/rook-ceph-osd-0-d7bdb7874-snkx6 (log-collector)
activate 
activate if not config.has_section('\''global'\''):
activate     config['\''global'\''] = {}
activate 
activate if not config.has_option('\''global'\'','\''fsid'\''):
activate     config['\''global'\'']['\''fsid'\''] = '\''dda6c867-4047-4fb9-a744-1575da352c5d'\''
activate 
activate with open('\''/etc/ceph/ceph.conf'\'', '\''w'\'') as configfile:
activate     config.write(configfile)
activate '
activate + ceph -n client.admin auth get-or-create osd.0 mon 'allow profile osd' mgr 'allow profile osd' osd 'allow *' -k /etc/ceph/admin-keyring-store/keyring
activate [osd.0]
activate     key = AQBCJEJlDLPvGxAAfuSxVKFq6S/sR/EzuGvbcg==
activate + [[ raw == \l\v\m ]]
activate ++ mktemp
activate + OSD_LIST=/tmp/tmp.RnodTkKWVr
activate + ceph-volume raw list /dev/mapper/vg_rook-lv_rook3
activate + cat /tmp/tmp.RnodTkKWVr
activate {}
activate + find_device
activate + python3 -c '
activate import sys, json
activate for _, info in json.load(sys.stdin).items():
activate     if info['\''osd_id'\''] == 0:
activate         print(info['\''device'\''], end='\'''\'')
activate         print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
activate         sys.exit(0)  # don'\''t keep processing once the disk is found
activate sys.exit('\''no disk found with OSD ID 0'\'')
activate '
activate no disk found with OSD ID 0
activate + ceph-volume raw list
activate + cat /tmp/tmp.RnodTkKWVr
activate {}
activate ++ find_device
activate ++ python3 -c '
activate import sys, json
activate for _, info in json.load(sys.stdin).items():
activate     if info['\''osd_id'\''] == 0:
activate         print(info['\''device'\''], end='\'''\'')
activate         print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
activate         sys.exit(0)  # don'\''t keep processing once the disk is found
activate sys.exit('\''no disk found with OSD ID 0'\'')
activate '
activate no disk found with OSD ID 0
activate + DEVICE=
Stream closed EOF for rook-ceph/rook-ceph-osd-0-d7bdb7874-snkx6 (activate)
stream logs failed container "osd" in pod "rook-ceph-osd-0-d7bdb7874-snkx6" is waiting to start: PodInitializing for rook-ceph/rook-ceph-osd-0-d7bdb7874-snkx6 (osd)

About this issue

  • Original URL
  • State: open
  • Created 8 months ago
  • Comments: 18 (9 by maintainers)

Most upvoted comments

@travisn any chance rook CI could test against latest-<ceph-release>-devel instead of just stable/released ceph tags? If we could test with this tag we wouldn’t have to wait a regression is introduced and released in ceph to catch it in rook CI.

@rkachach is opening an issue for the CI improvements to run our canary tests in the daily CI against those tags. We have other test suites that run against those tags, just not those canary tests with more combinations of OSDs.

Note that we expect v18.2.1 to have the fix. A test PR #13203 showed that the canary tests were all successful with the tag quay.io/ceph/daemon-base:latest-reef-devel.

Now we just need to get a release v17.2.8 with the fix as well.

Independent of the crash, the revert in https://github.com/ceph/ceph/pull/54392 is still needed to revert the change for LVs, otherwise Rook won’t find the expected devices for the OSDs. This results in the device to be found again, which would then not cause the crash.

reverting here the change from https://github.com/ceph/ceph/pull/52429 which introduced this regression

Agreed, this is also a blocker for v18.2.1. I was just chatting with @guits, he is looking into it, we should have a better understanding of the fix by tomorrow. Thus far there is not a workaround other than rolling back to v17.2.6.