cassandra-reaper: repairs repeatedly postponed and stuck
Reaper v2.0.3 Cassandra v3.11.5.1
every day I run a repair on a single keyspace and since few weeks ago repairs never end. The following is the information table taken from the reaper’s dashboard:
ID | 00000000-0000-0177-0000-000000000000
-- | --
Owner | g
Cause | g
Last event | postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
Start time | March 9, 2020 10:45 AM
End time |
Pause time |
Duration | 22 hours 17 minutes 10 seconds
Segment count | 136
Segment repaired | 67
Intensity | 0.8999999761581421
Repair parallelism | PARALLEL
Incremental repair | false
Repair threads | 1
Nodes |
Datacenters | DC1
Blacklist |
Creation time | March 9, 2020 10:45 AM
Available metrics(can require a full run before appearing) | io.cassandrareaper.service.RepairRunner.repairProgress. mycluster.mkphistory.00000000000000070000000000000000io.cassandrareaper.service.RepairRunner.segmentsDone. mycluster.mkphistory.00000000000000070000000000000000io.cassandrareaper.service.RepairRunner.segmentsTotal. mycluster.mkphistory.00000000000000070000000000000000io.cassandrareaper.service.RepairRunner.millisSinceLastRepair. mycluster.mkphistory.00000000000000070000000000000000
I also noted the very same message in the reaper’s log repeated infinite times:
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
Few weeks ago this repair lasts just a couple of hours launched with 4 threads. I tried di decrease the number of thread employed in the repair but the result hasn’t changed and the repair still stuck.
I also tried a rolling restart (I restarted also the reaper) without success.
Do you have any idea about this behavior?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 15 (5 by maintainers)
I’ll be looking into this shortly. Seems indeed like some segments could get stuck in
STARTEDstate, subsequently blocking repairs. I think this PR should fix it if indeed it’s what’s happening: https://github.com/thelastpickle/cassandra-reaper/pull/851