rook: ceph: failed to initialize OSD

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

OSD prepare pod sometimes fails when creating new OSD in PVC-based cluster. It’s due to the random failure of ceph-volume raw prepare.

I submitted an issue about this problem in Ceph issue tracker because I suspect it’s a Ceph’s problem.

https://tracker.ceph.com/issues/51034

For more details, please refer to the above-mentioned issue.

Expected behavior:

Suceed to initialize OSD

How to reproduce it (minimal and precise):

Create OSD on PVC continuously.

The reproduction probability is about 10%

** Workaround**:

Specify bluefs_buffered_io=false in rook-config-override cm.

File(s) to submit:

ceph.zip

Ceph related files.
- ceph.conf
- ceph-volume.log
- osd-mkfs-1622521334.log: a failure log of ceph-osd --mkfs
Rook specific files.
- rook-ceph-osd-prepare-set1-data-5.yaml: Problemetic OSD’s deployment
- osd-prepare-set1-data-5.pod.log: the log of OSD preparing pod
- osd-prepare-event.log: prepare pod’s kubernetes event log
- cephcluster yaml: CephCluster cluster resource

Environment:

OS: flatcar
Kernel: 5.10.38-flatcar
Rook version (use rook version inside of a Rook Pod): v1.6.3
Storage backend version (e.g. for ceph do ceph -v): ceph 16.2.4
Kubernetes version (use kubectl version): v1.20.7

Additional information:

This problem didn’t happen in Ceph 15.2.8.
I didn’t confirm whether this problem happen on host-based cluster.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 19 (12 by maintainers)

Commits related to this issue

Disable bluefs buffered io As a precaution/workaround for https://github.com/rook/rook/issues/8023 Signed-off-by: Joshua Hesketh <jhesketh@suse.com> — committed to jhesketh/rookcheck by jhesketh 2 years ago

Most upvoted comments

edit: ok, nevermind that, I removed the memory limits, and it’s working ok for now (I guess I’ll see what happens when I deploy other stuff to the cluster)

~~Getting this error on a fresh test cluster~~

The weird thing is this is using the same version of rook (1.9.1), ceph (v17.2.0-20220420), almost the same cluster yml, and the same storageclass (deploy/examples/csi/rbd/storageclass.yml) as my production cluster which has been running for a while (but wasn’t stood up directly on 1.9.1 / v17.2.0-20220420, it was upgraded to that from an older version). I added memory limits to the cluster.yml compared to the one I used in my production cluster setup. Everything else is the same.

Here’s the diff between 1.9.1’s cluster.yaml and my file:

24c24
<     image: quay.io/ceph/ceph:v16.2.7
---
>     image: quay.io/ceph/ceph:v17.2.0-20220420
57c57
<     count: 2
---
>     #count: 2
195,202c195,230
< # The requests and limits set here, allow the mgr pod to use half of one CPU core and 1 gigabyte of memory
< #    mgr:
< #      limits:
< #        cpu: "500m"
< #        memory: "1024Mi"
< #      requests:
< #        cpu: "500m"
< #        memory: "1024Mi"
---
>     mgr:
>       limits:
>         #cpu: "500m"
>         memory: "512Mi"
>       requests:
>         #cpu: "500m"
>         memory: "512Mi"
>     mon:
>       limits:
>         #cpu: "500m"
>         memory: "1024Mi"
>       requests:
>         #cpu: "500m"
>         memory: "1024Mi"
>     osd:
>       limits:
>         #cpu: "500m"
>         memory: "2048Mi"
>       requests:
>         #cpu: "500m"
>         memory: "2048Mi"
>     prepareosd:
>       limits:
>         memory: "50Mi"
>       requests:
>         memory: "50Mi"
>     crashcollector:
>       limits:
>         memory: "60Mi"
>       requests:
>         memory: "60Mi"
>     mgr-sidecar:
>       limits:
>         memory: "100Mi"
>       requests:
>         memory: "40Mi"
230c258
<       # databaseSizeMB: "1024" # uncomment if the disks are smaller than 100 GB
---
>       databaseSizeMB: "1024" # uncomment if the disks are smaller than 100 GB

I created three new Debian 11.3 VMs from the netinst image with just standard system utilities and SSH server, and partitioned without swap, then set up the same kubespray inventory settings as my production cluster, deployed with kubespray 2.19.0. This is the kubespray inventory:

inventory.zip

My intent was to set up the same versions on my test cluster as in my production cluster so that I can apply upgrades to the test cluster first so that I know what to expect and how to not screw up production. (Yeah, I did that in the wrong order, should’ve had the test cluster up first…oof)

Will I need to upgrade the ceph version to get this to work?

edit: ok, nevermind that, I removed the memory limits, and it’s working ok for now (I guess I’ll see what happens when I deploy other stuff to the cluster)

ftab on Jul 26, 2022

This problem was fixed in v16.2.6.

satoru-takeuchi on Nov 30, 2021

@leseb

I have some questions. Q1. What version is your Ceph cluster? My problem happens in v16.2.x but not in v15.2.x

I saw this in the CI, which runs v16.2.

Q2. Could you try whether my mitigation, setting bluefs_buffered_io=false work?

I can try?

Q3. Does Ceph team recognize this issue? I’m glad if they help me because I’ve investigated this issue but haven’t made any progress yet.

I’ll ask again the ceph team for more inputs.

Could you answer these questions?

leseb on Jun 23, 2021

@satoru-takeuchi I tried four or five times with different device sets of four Devices (WD Red Pro if that matters), and one time with three devices which worked directly and then a single device afterwards which did work too.

martin31821 on Jun 17, 2021

@satoru-takeuchi I’m on 16.2.4. I’m not having a PVC backed OSD, it’s raw devices with bluestore. I also don’t have rbd or CephFS here deployed, this setup is only serving RGW.

Edit: Provisioning four OSDs on four devices failed, three devices worked

martin31821 on Jun 17, 2021

@leseb I’ll answer next week because today is paid holiday.

satoru-takeuchi on Jun 10, 2021