cassandra-reaper: repairs repeatedly postponed and stuck

Reaper v2.0.3 Cassandra v3.11.5.1

every day I run a repair on a single keyspace and since few weeks ago repairs never end. The following is the information table taken from the reaper’s dashboard:


ID | 00000000-0000-0177-0000-000000000000
-- | --
Owner | g
Cause | g
Last event | postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
Start time | March 9, 2020 10:45 AM
End time |  
Pause time |  
Duration | 22 hours 17 minutes 10 seconds
Segment count | 136
Segment repaired | 67
Intensity | 0.8999999761581421
Repair parallelism | PARALLEL
Incremental repair | false
Repair threads | 1
Nodes |  
Datacenters | DC1
Blacklist |  
Creation time | March 9, 2020 10:45 AM
Available metrics(can require a full run before appearing) | io.cassandrareaper.service.RepairRunner.repairProgress. mycluster.mkphistory.00000000000000070000000000000000io.cassandrareaper.service.RepairRunner.segmentsDone. mycluster.mkphistory.00000000000000070000000000000000io.cassandrareaper.service.RepairRunner.segmentsTotal. mycluster.mkphistory.00000000000000070000000000000000io.cassandrareaper.service.RepairRunner.millisSinceLastRepair. mycluster.mkphistory.00000000000000070000000000000000

I also noted the very same message in the reaper’s log repeated infinite times:

INFO   [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO   [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO   [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO   [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO   [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO   [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO   [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO   [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair
INFO   [ mycluster:00000000-0000-0177-0000-000000000000:00000000-0000-c696-0000-000000000000] i.c.s.RepairRunner - postponed repair segment 00000000-0000-c696-0000-000000000000 because one of the hosts (xx.xx.xx.xx) was already involved in a repair

Few weeks ago this repair lasts just a couple of hours launched with 4 threads. I tried di decrease the number of thread employed in the repair but the result hasn’t changed and the repair still stuck.

I also tried a rolling restart (I restarted also the reaper) without success.

Do you have any idea about this behavior?

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 15 (5 by maintainers)

Most upvoted comments

I’ll be looking into this shortly. Seems indeed like some segments could get stuck in STARTED state, subsequently blocking repairs. I think this PR should fix it if indeed it’s what’s happening: https://github.com/thelastpickle/cassandra-reaper/pull/851

adejanovski on Mar 20, 2020