rook: Cluster unavailable after node reboot, symlink already exist
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior: I’m using Rook Ceph with specifics devices, identified by ids
helm_cephrook_nodes_devices:
- name: "vm-kube-slave-1"
devices:
- name: "/dev/disk/by-id/scsi-36000c29d381154d5114acf6c54b09ab5"
[.......]
Linux disk letter sdX can change when rebooting, and should not break the application
Actually, when starting the OSD, the init container activate detects the right new disk, but a symlink is already present to the old one
'
found device: /dev/sdg
+ DEVICE=/dev/sdg
+ [[ -z /dev/sdg ]]
+ ceph-volume raw activate --device /dev/sdg --no-systemd --no-tmpfs
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-3
Running command: /usr/bin/ceph-bluestore-tool prime-osd-dir --path /var/lib/ceph/osd/ceph-3 --no-mon-config --dev /dev/sdg
Running command: /usr/bin/chown -R ceph:ceph /dev/sdg
Running command: /usr/bin/ln -s /dev/sdg /var/lib/ceph/osd/ceph-3/block
stderr: ln: failed to create symbolic link '/var/lib/ceph/osd/ceph-3/block': File exists
Traceback (most recent call last):
File "/usr/sbin/ceph-volume", line 11, in <module>
load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__
self.main(self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
return f(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
terminal.dispatch(self.mapper, subcommand_args)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
terminal.dispatch(self.mapper, self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/activate.py", line 166, in main
systemd=not self.args.no_systemd)
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
return func(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/activate.py", line 88, in activate
systemd=systemd)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/activate.py", line 48, in activate_bluestore
prepare_utils.link_block(meta['device'], osd_id)
File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 371, in link_block
_link_device(block_device, 'block', osd_id)
File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 339, in _link_device
process.run(command)
File "/usr/lib/python3.6/site-packages/ceph_volume/process.py", line 147, in run
raise RuntimeError(msg)
RuntimeError: command returned non-zero exit status: 1

Expected behavior: Rook Ceph detect the good disk when the node reboot, even if the letter sdX change the symlink should be recreated
How to reproduce it (minimal and precise):
File(s) to submit:
- Cluster CR (custom resource), typically called
cluster.yaml, if necessary
Logs to submit:
-
Operator’s logs, if necessary
-
Crashing pod(s) logs, if necessary
To get logs, use
kubectl -n <namespace> logs <pod name>When pasting logs, always surround them with backticks or use theinsert codebutton from the Github UI. Read GitHub documentation if you need help.
Cluster Status to submit:
HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 21 pgs inactive; 570 slow ops, oldest one blocked for 125431 sec, daemons [osd.1,osd.2,osd.4] have slow ops.
sh-4.4$ ceph status
cluster:
id: ecf8035e-5899-4327-9a70-b86daac1f642
health: HEALTH_WARN
1 MDSs report slow metadata IOs
Reduced data availability: 21 pgs inactive
570 slow ops, oldest one blocked for 125447 sec, daemons [osd.1,osd.2,osd.4] have slow ops.
services:
mon: 1 daemons, quorum a (age 3d)
mgr: a(active, since 114m)
mds: 1/1 daemons up, 1 hot standby
osd: 5 osds: 3 up (since 66m), 3 in (since 8h)
data:
volumes: 1/1 healthy
pools: 3 pools, 49 pgs
objects: 186 objects, 45 MiB
usage: 347 MiB used, 150 GiB / 150 GiB avail
pgs: 42.857% pgs unknown
28 active+clean
21 unknown
Environment:
- OS (e.g. from /etc/os-release):
NAME="Red Hat Enterprise Linux" VERSION="8.6 (Ootpa)" - Kernel (e.g.
uname -a):Linux vm-kube-slave-6 4.18.0-372.19.1.el8_6.x86_64 #1 SMP Mon Jul 18 11:14:02 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux - Cloud provider or hardware configuration:
- Rook version (use
rook versioninside of a Rook Pod): 1.9.7 - Storage backend version (e.g. for ceph do
ceph -v): filesystem - Kubernetes version (use
kubectl version): 1.23 - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): RKE
- Storage backend status (e.g. for Ceph use
ceph healthin the Rook Ceph toolbox):
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 18 (14 by maintainers)
Commits related to this issue
- osd: handle device name change and device removel correctly TBD Closes: https://github.com/rook/rook/issues/10860 Signed-off-by: Satoru Takeuchi <satoru.takeuchi@gmail.com> — committed to cybozu-go/rook by satoru-takeuchi a year ago
- osd: handle device name change and device removel correctly If a kernel device name change happens and a block device file in the OSD directory becomes missing, this OSD fails to start continuously. ... — committed to cybozu-go/rook by satoru-takeuchi a year ago
- osd: handle device name change and device removel correctly If a kernel device name change happens and a block device file in the OSD directory becomes dangling link, this OSD fails to start continuo... — committed to cybozu-go/rook by satoru-takeuchi a year ago
- osd: handle device name change and device removel correctly If a kernel device name change happens and a block device file in the OSD directory becomes dangling link, this OSD fails to start continuo... — committed to rook/rook by satoru-takeuchi a year ago
- osd: handle device name change and device removel correctly If a kernel device name change happens and a block device file in the OSD directory becomes dangling link, this OSD fails to start continuo... — committed to rook/rook by satoru-takeuchi a year ago
- osd: handle device name change and device removel correctly If a kernel device name change happens and a block device file in the OSD directory becomes dangling link, this OSD fails to start continuo... — committed to koor-tech/koor by satoru-takeuchi a year ago
@travisn I’m testing #11567 , which resolves this issue. There are several remaining tests. I’ll finish this todat.
It takes long time due to lack of my extra time and there are many test case.
Thanks a lot for the fix, I’m just waiting for the next release 😃
I’m still investigating this issue. This problem might be in ceph…