rook: OSD Prepare fails due to "unparsable uuid"

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior: rook-ceph-osd-prepare fails due to “unparsable uuid”, leaves behind dirty drives and ignores them on retry, CephCluster has no OSDs

Expected behavior: OSDs get prepared and CephCluster is alive and well.

How to reproduce it (minimal and precise):

single-node k3s cluster on dedi with 2 hdd dedicated to ceph, in my case this is managed by a rook-ceph-cluster helm-release

File(s) to submit:

Environment:

  • OS (e.g. from /etc/os-release): Ubuntu 22.04 LTS
  • Kernel (e.g. uname -a): Linux ... 5.15.0-27-generic #28-Ubuntu SMP Thu Apr 14 04:55:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • Cloud provider or hardware configuration: Hetzner AX-51 /w 2 NVMEs for system & 2 HDDs for Ceph
  • Rook version (use rook version inside of a Rook Pod): v1.9.2
  • Storage backend version (e.g. for ceph do ceph -v): ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
  • Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6+k3s1", GitCommit:"418c3fa858b69b12b9cefbcff0526f666a6236b9", GitTreeState:"clean", BuildDate:"2022-04-28T22:16:18Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): k3s ?
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_WARN Reduced data availability: 32 pgs inactive; OSD count 0 < osd_pool_default_size 1

I’ve been given following issues that seem similar: #9646 & #8023 However both have different errors.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 31 (18 by maintainers)

Most upvoted comments

I’ve just had the same issue, fixed by raising the prepareosd memory resource limit from 400Mi to 800Mi.

and yeah, _read_fsid unparsable_uuid is not an error on its own.

rook-ceph 1.9.9

@travisn Many users have hit this problem. So It seems to be better to add ‘resource’ field to prepare pods in CephCluster CR examples by default. Does it make sense?

Shared it to your email address on your profile. Thanks

@frittentheke

I was about to raise an issue on the ceph issue tracker about this when I found this here …

Have you already open an issue?

No, but I will contribute to https://tracker.ceph.com/issues/54019 which is apparently tracking this.

@logan2211 Could you show me the operator log which indicates why OSD creation failed?

rook-ceph-osd-prepare-host-1-46fjf provision Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/keyring
rook-ceph-osd-prepare-host-1-46fjf provision Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-1/
rook-ceph-osd-prepare-host-1-46fjf provision Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 1 --monmap /var/lib/ceph/osd/ceph-1/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-1/ --osd-uuid 2ed38de9-367f-4883-9ad1-d5a0221b971f --setuser ceph --setgroup ceph
rook-ceph-osd-prepare-host-1-46fjf provision  stderr: 2022-05-16T00:32:14.961+0000 7f3a76bb5080 -1 bluestore(/var/lib/ceph/osd/ceph-1//block) _read_bdev_label bad crc on label, expected 608112536 != actual 946372990
rook-ceph-osd-prepare-host-1-46fjf provision  stderr: 2022-05-16T00:32:14.961+0000 7f3a76bb5080 -1 bluestore(/var/lib/ceph/osd/ceph-1//block) _read_bdev_label bad crc on label, expected 608112536 != actual 946372990
rook-ceph-osd-prepare-host-1-46fjf provision  stderr: 2022-05-16T00:32:14.961+0000 7f3a76bb5080 -1 bluestore(/var/lib/ceph/osd/ceph-1//block) _read_bdev_label bad crc on label, expected 608112536 != actual 946372990
rook-ceph-osd-prepare-host-1-46fjf provision  stderr: 2022-05-16T00:32:14.961+0000 7f3a76bb5080 -1 bluestore(/var/lib/ceph/osd/ceph-1/) _read_fsid unparsable uuid
rook-ceph-osd-prepare-host-1-46fjf provision --> Was unable to complete a new OSD, will rollback changes
rook-ceph-osd-prepare-host-1-46fjf provision Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.1 --yes-i-really-mean-it
rook-ceph-osd-prepare-host-1-46fjf provision  stderr: purged osd.1
rook-ceph-osd-prepare-host-1-46fjf provision Traceback (most recent call last):
rook-ceph-osd-prepare-host-1-46fjf provision   File "/usr/sbin/ceph-volume", line 11, in <module>
rook-ceph-osd-prepare-host-1-46fjf provision     load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
rook-ceph-osd-prepare-host-1-46fjf provision   File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 40, in __init__
rook-ceph-osd-prepare-host-1-46fjf provision     self.main(self.argv)
rook-ceph-osd-prepare-host-1-46fjf provision   File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
rook-ceph-osd-prepare-host-1-46fjf provision     return f(*a, **kw)
rook-ceph-osd-prepare-host-1-46fjf provision   File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 152, in main
rook-ceph-osd-prepare-host-1-46fjf provision     terminal.dispatch(self.mapper, subcommand_args)
rook-ceph-osd-prepare-host-1-46fjf provision   File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
rook-ceph-osd-prepare-host-1-46fjf provision     instance.main()
rook-ceph-osd-prepare-host-1-46fjf provision   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
rook-ceph-osd-prepare-host-1-46fjf provision     terminal.dispatch(self.mapper, self.argv)
rook-ceph-osd-prepare-host-1-46fjf provision   File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
rook-ceph-osd-prepare-host-1-46fjf provision     instance.main()
rook-ceph-osd-prepare-host-1-46fjf provision   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 169, in main
rook-ceph-osd-prepare-host-1-46fjf provision     self.safe_prepare(self.args)
rook-ceph-osd-prepare-host-1-46fjf provision   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 91, in safe_prepare
rook-ceph-osd-prepare-host-1-46fjf provision     self.prepare()
rook-ceph-osd-prepare-host-1-46fjf provision   File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
rook-ceph-osd-prepare-host-1-46fjf provision     return func(*a, **kw)
rook-ceph-osd-prepare-host-1-46fjf provision   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 134, in prepare
rook-ceph-osd-prepare-host-1-46fjf provision     tmpfs,
rook-ceph-osd-prepare-host-1-46fjf provision   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/prepare.py", line 68, in prepare_bluestore
rook-ceph-osd-prepare-host-1-46fjf provision     db=db
rook-ceph-osd-prepare-host-1-46fjf provision   File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 481, in osd_mkfs_bluestore
rook-ceph-osd-prepare-host-1-46fjf provision     raise RuntimeError('Command failed with exit code %s: %s' % (returncode, ' '.join(command)))
rook-ceph-osd-prepare-host-1-46fjf provision RuntimeError: Command failed with exit code 250: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 1 --monmap /var/lib/ceph/osd/ceph-1/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-1/ --osd-uuid 2ed38de9-367f-4883-9ad1-d5a0221b971f --setuser ceph --setgroup ceph: exit status 1

I confirmed that v1.8.9 works normally. I tested v1.9.0, 1.9.1, 1.9.2, and 1.9.3 and all produce the error above when preparing the OSD.

“_read_fsid unparsable uuid” itself is not a bug. It is also shown when osd preparation succeeds.

...
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/keyring
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/
Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid a8c30bd5-acae-49fa
-8cba-07ab9f51ea90 --setuser ceph --setgroup ceph
 stderr: 2022-05-15T13:17:30.971+0000 7f58c22d0080 -1 bluestore(/var/lib/ceph/osd/ceph-0/) _read_fsid unparsable uuid
--> ceph-volume raw clear prepare successful for: /dev/sdc1
2022-05-15 13:17:33.261204 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log lvm list  --format json
2022-05-15 13:17:33.610800 D | cephosd: {}
2022-05-15 13:17:33.610826 I | cephosd: 0 ceph-volume lvm osd devices configured on this node
2022-05-15 13:17:33.610873 D | exec: Running command: stdbuf -oL ceph-volume --log-path /tmp/ceph-log raw list --format json
2022-05-15 13:17:34.225929 D | cephosd: {
    "a8c30bd5-acae-49fa-8cba-07ab9f51ea90": {
        "ceph_fsid": "f0c6110a-e1b3-4a7b-b14f-54d3d67fdde1",
        "device": "/dev/sdc1",
        "osd_id": 0,
        "osd_uuid": "a8c30bd5-acae-49fa-8cba-07ab9f51ea90",
        "type": "bluestore"
    }
}
2022-05-15 13:17:34.226077 D | exec: Running command: lsblk /dev/sdc1 --bytes --nodeps --pairs --paths --output SIZE,ROTA,RO,TYPE,PKNAME,NAME,KNAME
2022-05-15 13:17:34.227808 I | cephosd: setting device class "hdd" for device "/dev/sdc1"
2022-05-15 13:17:34.227911 I | cephosd: 1 ceph-volume raw osd devices configured on this node
2022-05-15 13:17:34.227990 I | cephosd: devices = [{ID:0 Cluster:ceph UUID:a8c30bd5-acae-49fa-8cba-07ab9f51ea90 DevicePartUUID: DeviceClass:hdd BlockPath:/dev/sdc1 MetadataPath: WalPath: SkipLVRelease:true Location:root=default host=ubuntu2004 LVBackedPV:false CVMode:raw Store:bluestore TopologyAffinity:}]
...

The messages indicating each problem are in the previous lines.

 stderr: 2022-05-05T22:42:46.413+0000 7fb83eaf8080 -1 bluestore(/var/lib/ceph/osd/ceph-0//block) _read_bdev_label bad crc on label, expected 2962224517 != actual 1963126265
 stderr: 2022-05-05T22:42:46.413+0000 7fb83eaf8080 -1 bluestore(/var/lib/ceph/osd/ceph-0//block) _read_bdev_label bad crc on label, expected 2962224517 != actual 1963126265
 stderr: 2022-05-05T22:42:46.413+0000 7fb83eaf8080 -1 bluestore(/var/lib/ceph/osd/ceph-0//block) _read_bdev_label bad crc on label, expected 2962224517 != actual 1963126265

However, I couldn’t reproduce this problem yet. My condition is as follows.

In all combinations, OSD started without any problems.

@alyti Did you remember the contents of the target device before creating the rook/ceph cluster?

@logan2211 Could you show me the operator log which indicates why OSD creation failed?

@travisn There seems to be multiple problems in OSD creation. At least it isn’t only come from #10212 because this change is not in v1.9.2 but @alyti 's problem and the problem in #10160 happened in 1.9.2 or older. I’m not sure if the root cause is in rook or ceph. Anyway, I’ll handle both this issue and #10160.

@logan2211 Thanks for the data point that this is a regression. This seems to be related to #10212, not sure there is a workaround other than sticking with v1.9.2. @satoru-takeuchi Can you take a look? Sounds like a number of users are hitting this.

I have the same “unparsable uuid” error when deploying new OSDs on 1.9.3. I’ve been deploying Rook clusters for years with:

      storage:
        devicePathFilter: "^/dev/disk/by-partlabel/OSD[0-9]+"

The particular cluster I’m experiencing this issue on currently was deployed with 1.8.9. Then, after upgrading to 1.9.3, it will no longer add new OSDs because of the unparsable uuid error. Existing OSDs work fine. It doesn’t seem feasible to completely rebuild using PVCs. I’m using k3s on bare metal @travisn.