rook: ceph volume command hang during add osds to cluster
Is this a bug report or feature request?
- Bug Report
Hi, I was trying to add some osds to my ceph cluster, the osd-prepare job stuck in Running state, while scanning the job’s log, I found that the job stuck while executing ceph volume command. We used lvconfig command to find udev_sync and udev_rules’s value was 1, I post the content of lvm.conf on my device below.
2023-10-20 11:27:49.613068 D | cephosd: &{Name:sdd Parent: HasChildren:false DevLinks:/dev/disk/by-id/scsi-SATA_QEMU_HARDDISK_QM00007 /dev/disk/by-path/pci-0000:00:1f.2-ata-4 /dev/disk/by-path/pci-0000:00:1f.2-ata-4.0 /dev/disk/by-uuid/2023-10-20-09-59-11-00 /dev/disk/by-id/ata-QEMU_HARDDISK_QM00007 /dev/disk/by-id/scsi-0ATA_QEMU_HARDDISK_QM00007 /dev/disk/by-label/config-2 /dev/disk/by-diskseq/4 /dev/disk/by-id/scsi-1ATA_QEMU_HARDDISK_QM00007 Size:1048576 UUID:8a8532c9-2fdb-43ed-a5c4-8331065ed8d1 Serial:QEMU_HARDDISK_QM00007 Type:disk Rotational:true Readonly:false Partitions:[] Filesystem:iso9660 Mountpoint: Vendor:ATA Model:QEMU_HARDDISK WWN: WWNVendorExtension: Empty:false CephVolumeData: RealPath:/dev/sdd KernelName:sdd Encrypted:false}
2023-10-20 11:27:49.613080 I | cephosd: skipping device "sda1" with mountpoint "boot"
2023-10-20 11:27:49.613085 I | cephosd: skipping device "sda2" with mountpoint "rootfs"
2023-10-20 11:27:49.613089 I | cephosd: old lsblk can't detect bluestore signature, so try to detect here
2023-10-20 11:27:49.614402 D | exec: Running command: lsblk /dev/sdb --bytes --nodeps --pairs --paths --output SIZE,ROTA,RO,TYPE,PKNAME,NAME,KNAME,MOUNTPOINT,FSTYPE
2023-10-20 11:27:49.620276 D | sys: lsblk output: "SIZE=\"214748364800\" ROTA=\"1\" RO=\"0\" TYPE=\"disk\" PKNAME=\"\" NAME=\"/dev/sdb\" KNAME=\"/dev/sdb\" MOUNTPOINT=\"\" FSTYPE=\"\""
2023-10-20 11:27:49.620308 D | exec: Running command: ceph-volume inventory --format json /dev/sdb
I found similar issue here, but it is helpless
# Configuration option activation/udev_sync.
# Use udev notifications to synchronize udev and LVM.
# The --noudevsync option overrides this setting.
# When disabled, LVM commands will not wait for notifications from
# udev, but continue irrespective of any possible udev processing in
# the background. Only use this if udev is not running or has rules
# that ignore the devices LVM creates. If enabled when udev is not
# running, and LVM processes are waiting for udev, run the command
# 'dmsetup udevcomplete_all' to wake them up.
# This configuration option has an automatic default value.
# udev_sync = 1
# Configuration option activation/udev_rules.
# Use udev rules to manage LV device nodes and symlinks.
# When disabled, LVM will manage the device nodes and symlinks for
# active LVs itself. Manual intervention may be required if this
# setting is changed while LVs are active.
# This configuration option has an automatic default value.
# udev_rules = 1
From above we can see that the default value of udev_rules and udev_sync were 1, after searching the Rook’s source code, we found these codes like below:
func UpdateLVMConfig(context *clusterd.Context, onPVC, lvBackedPV bool) error {
input, err := os.ReadFile(lvmConfPath)
if err != nil {
return errors.Wrapf(err, "failed to read lvm config file %q", lvmConfPath)
}
output := bytes.Replace(input, []byte("udev_sync = 1"), []byte("udev_sync = 0"), 1)
output = bytes.Replace(output, []byte("allow_changes_with_duplicate_pvs = 0"), []byte("allow_changes_with_duplicate_pvs = 1"), 1)
output = bytes.Replace(output, []byte("udev_rules = 1"), []byte("udev_rules = 0"), 1)
output = bytes.Replace(output, []byte("use_lvmetad = 1"), []byte("use_lvmetad = 0"), 1)
output = bytes.Replace(output, []byte("obtain_device_list_from_udev = 1"), []byte("obtain_device_list_from_udev = 0"), 1)
Obviously these settings didn’t work in my case, have you faced situations like this? and can you push a PR to fix this on the version(11.4) we are using? How to reproduce it (minimal and precise):
File(s) to submit:
- Cluster CR (custom resource), typically called
cluster.yaml, if necessary
Logs to submit:
-
Operator’s logs, if necessary
-
Crashing pod(s) logs, if necessary
To get logs, use
kubectl -n <namespace> logs <pod name>When pasting logs, always surround them with backticks or use theinsert codebutton from the Github UI. Read GitHub documentation if you need help.
Cluster Status to submit:
-
Output of kubectl commands, if necessary
To get the health of the cluster, use
kubectl rook-ceph healthTo get the status of the cluster, usekubectl rook-ceph ceph statusFor more details, see the Rook kubectl Plugin
Environment:
- OS (e.g. from /etc/os-release): Rhel 9.2
- Kernel (e.g.
uname -a):5.14.0-284.11.1 - Cloud provider or hardware configuration:
- Rook version (use
rook versioninside of a Rook Pod): 11.4 - Storage backend version (e.g. for ceph do
ceph -v):17.2.6 - Kubernetes version (use
kubectl version):1.22.6 - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):RKE2
- Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox):
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 27 (13 by maintainers)
yes
Yes,I’ve tested rook on Centos7.9, Ubuntu22.04 and Redhat7.9 with same images and they were working well.
blkidcommand got stuck while executing.Not yet. I’m tring to reropuce this problem as a first step.