rook: Ceph PG repair stuck forever
[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph -s
cluster:
id: c40d82d5-3193-457d-a628-a3db67839a37
health: HEALTH_ERR
1 scrub errors
Possible data damage: 1 pg inconsistent
services:
mon: 3 daemons, quorum g,j,l (age 2w)
mgr: a(active, since 10m)
osd: 10 osds: 10 up (since 17h), 10 in (since 3w)
data:
pools: 4 pools, 193 pgs
objects: 8.31M objects, 32 TiB
usage: 95 TiB used, 33 TiB / 127 TiB avail
pgs: 192 active+clean
1 active+clean+scrubbing+deep+inconsistent+repair
io:
client: 682 B/s rd, 6.9 MiB/s wr, 0 op/s rd, 162 op/s wr
and
[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 1.50 is active+clean+scrubbing+deep+inconsistent+repair, acting [5,17,2]
[root@rook-ceph-tools-68958dbb7f-klmcn /]#
and
[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph pg deep-scrub 1.50
instructing pg 1.50 on osd.5 to deep-scrub
[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph pg repair 1.50
instructing pg 1.50 on osd.5 to repair
[root@rook-ceph-tools-68958dbb7f-klmcn /]#
and
[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph pg dump_stuck unclean
ok
[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph pg dump_stuck stale
ok
[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph pg dump_stuck inactive
ok
[root@rook-ceph-tools-68958dbb7f-klmcn /]#
and
[root@rook-ceph-tools-68958dbb7f-klmcn /]# rados list-inconsistent-obj 1.50 --format=json-pretty
{
"epoch": 667024,
"inconsistents": []
}
[root@rook-ceph-tools-68958dbb7f-klmcn /]#
Still, it’s been more than 3 days, status is the same.
[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph -s
cluster:
id: c40d82d5-3193-457d-a628-a3db67839a37
health: HEALTH_ERR
1 scrub errors
Possible data damage: 1 pg inconsistent
and still the same PG is in question
[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 1.50 is active+clean+scrubbing+deep+inconsistent+repair, acting [5,17,2]
Restarting all the OSDs and underlying nodes is also not helping.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 20 (9 by maintainers)
Yes, these configurations are global. You can set a specific OSD debug level with the following command:
ceph tell osd.<id> config set debug_osd 20. As @sp98 mentioned, please open an issue in the tracker and we will continue the investigation from there.@zhucan , we scale up by adding few more OSDs, and waited for completion of data rebalancing, it took few days but eventually error was gone. (Then we scale down by removing additional OSDs)
@nirav-chotai not really sure if the behavior you mentioned in the above comment is related to the
Health Erryou mentioned in the ticket description.