rook: Ceph PG repair stuck forever

[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph -s
  cluster:
    id:     c40d82d5-3193-457d-a628-a3db67839a37
    health: HEALTH_ERR
            1 scrub errors
            Possible data damage: 1 pg inconsistent
 
  services:
    mon: 3 daemons, quorum g,j,l (age 2w)
    mgr: a(active, since 10m)
    osd: 10 osds: 10 up (since 17h), 10 in (since 3w)
 
  data:
    pools:   4 pools, 193 pgs
    objects: 8.31M objects, 32 TiB
    usage:   95 TiB used, 33 TiB / 127 TiB avail
    pgs:     192 active+clean
             1   active+clean+scrubbing+deep+inconsistent+repair
 
  io:
    client:   682 B/s rd, 6.9 MiB/s wr, 0 op/s rd, 162 op/s wr

and

[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
    pg 1.50 is active+clean+scrubbing+deep+inconsistent+repair, acting [5,17,2]
[root@rook-ceph-tools-68958dbb7f-klmcn /]# 

and

[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph pg deep-scrub 1.50
instructing pg 1.50 on osd.5 to deep-scrub
[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph pg repair 1.50
instructing pg 1.50 on osd.5 to repair
[root@rook-ceph-tools-68958dbb7f-klmcn /]# 

and

[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph pg dump_stuck unclean
ok
[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph pg dump_stuck stale
ok
[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph pg dump_stuck inactive
ok
[root@rook-ceph-tools-68958dbb7f-klmcn /]# 

and

[root@rook-ceph-tools-68958dbb7f-klmcn /]# rados list-inconsistent-obj 1.50 --format=json-pretty
{
    "epoch": 667024,
    "inconsistents": []
}
[root@rook-ceph-tools-68958dbb7f-klmcn /]# 

Still, it’s been more than 3 days, status is the same.

[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph -s
  cluster:
    id:     c40d82d5-3193-457d-a628-a3db67839a37
    health: HEALTH_ERR
            1 scrub errors
            Possible data damage: 1 pg inconsistent

and still the same PG is in question

[root@rook-ceph-tools-68958dbb7f-klmcn /]# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
    pg 1.50 is active+clean+scrubbing+deep+inconsistent+repair, acting [5,17,2]

Restarting all the OSDs and underlying nodes is also not helping.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 20 (9 by maintainers)

Most upvoted comments

@sp98 , and one more thing, is it possible to run specific OSD with debug_osd = 20? I’m using Rook/Ceph on K8s, so I know about using rook-config-override ConfigMap but that will apply at cluster i.e. global level, right?

[osd]
        debug_osd = 20

Yes, these configurations are global. You can set a specific OSD debug level with the following command: ceph tell osd.<id> config set debug_osd 20. As @sp98 mentioned, please open an issue in the tracker and we will continue the investigation from there.

@zhucan , we scale up by adding few more OSDs, and waited for completion of data rebalancing, it took few days but eventually error was gone. (Then we scale down by removing additional OSDs)

@nirav-chotai not really sure if the behavior you mentioned in the above comment is related to the Health Err you mentioned in the ticket description.