rook: Ceph v17.2.7 fails to create or start OSDs configured on PVs or LVs
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior: Ceph v17.2.7 is failing to start OSDs in certain configurations.
- LVs as seen here
- Backed by PVs when created with
storageClassDeviceSets(cluster-on-pvc example) as seen in the attempt to update the CI to v17.2.7 in #13127 where the CI is failing against the PVC configuration of the Canary tests.
Expected behavior: All OSD configurations should successfully be created and upgraded.
How to reproduce it (minimal and precise):
Create a v17.2.7 cluster or upgrade to Ceph v17.2.7 with OSDs on LVs or PVs.
The symptoms are that the OSDs fail to be created, as seen in the OSD prepare log:
[2023-10-31 17:29:08,090][ceph_volume.devices.raw.list][DEBUG ] inspecting devices: ['/mnt/set1-data-0bv5xb']
[2023-10-31 17:29:08,090][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
return f(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
terminal.dispatch(self.mapper, subcommand_args)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
terminal.dispatch(self.mapper, self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 166, in main
self.list(args)
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
return func(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 122, in list
report = self.generate(args.device)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 91, in generate
info_device = [info for info in info_devices if info['NAME'] == dev][0]
IndexError: list index out of range
2023-10-31 17:29:08.150895 C | rookcmd: failed to configure devices: failed to get device already provisioned by ceph-volume raw: failed to retrieve ceph-volume raw list results: failed ceph-volume call (see ceph-volume log above for more details): exit status 1
Or when an existing LV-based OSD starts, this issue was observed in the activate init container where the disk is not found:
activate + OSD_ID=0
stream logs failed container "expand-bluefs" in pod "rook-ceph-osd-0-d7bdb7874-snkx6" is waiting to start: PodInitializing for rook-ceph/rook-ceph-osd-0-d7bdb7874-snkx6 (expand-bluefs)
stream logs failed container "chown-container-data-dir" in pod "rook-ceph-osd-0-d7bdb7874-snkx6" is waiting to start: PodInitializing for rook-ceph/rook-ceph-osd-0-d7bdb7874-snkx6 (chown-container-data-dir)
stream logs failed container "osd" in pod "rook-ceph-osd-0-d7bdb7874-snkx6" is waiting to start: PodInitializing for rook-ceph/rook-ceph-osd-0-d7bdb7874-snkx6 (osd)
activate + CEPH_FSID=dda6c867-4047-4fb9-a744-1575da352c5d
activate + OSD_UUID=447c4553-d301-457a-b48d-69c1a6afe74a
activate + OSD_STORE_FLAG=--bluestore
activate + OSD_DATA_DIR=/var/lib/ceph/osd/ceph-0
activate + CV_MODE=raw
activate + DEVICE=/dev/mapper/vg_rook-lv_rook3
activate + cp --no-preserve=mode /etc/temp-ceph/ceph.conf /etc/ceph/ceph.conf
activate + python3 -c '
activate import configparser
activate
activate config = configparser.ConfigParser()
activate config.read('\''/etc/ceph/ceph.conf'\'')
stream logs failed container "log-collector" in pod "rook-ceph-osd-0-d7bdb7874-snkx6" is waiting to start: PodInitializing for rook-ceph/rook-ceph-osd-0-d7bdb7874-snkx6 (log-collector)
activate
activate if not config.has_section('\''global'\''):
activate config['\''global'\''] = {}
activate
activate if not config.has_option('\''global'\'','\''fsid'\''):
activate config['\''global'\'']['\''fsid'\''] = '\''dda6c867-4047-4fb9-a744-1575da352c5d'\''
activate
activate with open('\''/etc/ceph/ceph.conf'\'', '\''w'\'') as configfile:
activate config.write(configfile)
activate '
activate + ceph -n client.admin auth get-or-create osd.0 mon 'allow profile osd' mgr 'allow profile osd' osd 'allow *' -k /etc/ceph/admin-keyring-store/keyring
activate [osd.0]
activate key = AQBCJEJlDLPvGxAAfuSxVKFq6S/sR/EzuGvbcg==
activate + [[ raw == \l\v\m ]]
activate ++ mktemp
activate + OSD_LIST=/tmp/tmp.RnodTkKWVr
activate + ceph-volume raw list /dev/mapper/vg_rook-lv_rook3
activate + cat /tmp/tmp.RnodTkKWVr
activate {}
activate + find_device
activate + python3 -c '
activate import sys, json
activate for _, info in json.load(sys.stdin).items():
activate if info['\''osd_id'\''] == 0:
activate print(info['\''device'\''], end='\'''\'')
activate print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
activate sys.exit(0) # don'\''t keep processing once the disk is found
activate sys.exit('\''no disk found with OSD ID 0'\'')
activate '
activate no disk found with OSD ID 0
activate + ceph-volume raw list
activate + cat /tmp/tmp.RnodTkKWVr
activate {}
activate ++ find_device
activate ++ python3 -c '
activate import sys, json
activate for _, info in json.load(sys.stdin).items():
activate if info['\''osd_id'\''] == 0:
activate print(info['\''device'\''], end='\'''\'')
activate print('\''found device: '\'' + info['\''device'\''], file=sys.stderr) # log the disk we found to stderr
activate sys.exit(0) # don'\''t keep processing once the disk is found
activate sys.exit('\''no disk found with OSD ID 0'\'')
activate '
activate no disk found with OSD ID 0
activate + DEVICE=
Stream closed EOF for rook-ceph/rook-ceph-osd-0-d7bdb7874-snkx6 (activate)
stream logs failed container "osd" in pod "rook-ceph-osd-0-d7bdb7874-snkx6" is waiting to start: PodInitializing for rook-ceph/rook-ceph-osd-0-d7bdb7874-snkx6 (osd)
About this issue
- Original URL
- State: open
- Created 8 months ago
- Comments: 18 (9 by maintainers)
@travisn FYI https://github.com/ceph/ceph/pull/54514
@rkachach is opening an issue for the CI improvements to run our canary tests in the daily CI against those tags. We have other test suites that run against those tags, just not those canary tests with more combinations of OSDs.
Note that we expect v18.2.1 to have the fix. A test PR #13203 showed that the canary tests were all successful with the tag
quay.io/ceph/daemon-base:latest-reef-devel.Now we just need to get a release v17.2.8 with the fix as well.
Independent of the crash, the revert in https://github.com/ceph/ceph/pull/54392 is still needed to revert the change for LVs, otherwise Rook won’t find the expected devices for the OSDs. This results in the device to be found again, which would then not cause the crash.
reverting here the change from https://github.com/ceph/ceph/pull/52429 which introduced this regression
Agreed, this is also a blocker for v18.2.1. I was just chatting with @guits, he is looking into it, we should have a better understanding of the fix by tomorrow. Thus far there is not a workaround other than rolling back to v17.2.6.