foca: BUG! Probe cycle finished without running its full course

I noticed hits happening a lot in my new project using foca:

2023-06-26T17:33:00Z app[5918571e9b3e83] waw [info]2023-06-26T17:33:00.162462Z ERROR corro_agent::broadcast: foca: error handling timer: BUG! Probe cycle finished without running its full course
2023-06-26T17:33:00Z app[5918571e9b3e83] waw [info]2023-06-26T17:33:00.162586Z  WARN handle_timer{event=SendIndirectProbe { probed_id: Actor { id: ActorId(0c5dce8c-fd9b-41a1-9b7c-419629aa3e2b), addr: [fdaa:2:4742:a7b:188:5f1f:9d5f:2]:8787, bump: 5213 }, token: 0 }}: foca: SendIndirectProbe: Member not being probed probed_id=Actor { id: ActorId(0c5dce8c-fd9b-41a1-9b7c-419629aa3e2b), addr: [fdaa:2:4742:a7b:188:5f1f:9d5f:2]:8787, bump: 5213 }

2023-06-26T17:34:31Z app[178199dc492228] cdg [info]2023-06-26T17:34:31.239548Z ERROR corro_agent::broadcast: foca: error handling timer: BUG! Probe cycle finished without running its full course
2023-06-26T17:34:31Z app[178199dc492228] cdg [info]2023-06-26T17:34:31.239671Z  WARN handle_timer{event=SendIndirectProbe { probed_id: Actor { id: ActorId(0c5dce8c-fd9b-41a1-9b7c-419629aa3e2b), addr: [fdaa:2:4742:a7b:188:5f1f:9d5f:2]:8787, bump: 5213 }, token: 0 }}: foca: SendIndirectProbe: Member not being probed probed_id=Actor { id: ActorId(0c5dce8c-fd9b-41a1-9b7c-419629aa3e2b), addr: [fdaa:2:4742:a7b:188:5f1f:9d5f:2]:8787, bump: 5213 }

I noticed on that cdg instance, this happened a few seconds before:

2023-06-26T17:34:10Z app[178199dc492228] cdg [info]2023-06-26T17:34:10.570726Z  INFO handle_timer{event=ProbeRandomMember(0)}: foca: Member failed probe, will declare it down if it doesn't react member_id=Actor { id: ActorId(23ce24c0-4ddf-4e00-8679-eeb40087af20), addr: [fdaa:2:4742:a7b:d5a7:25d0:75f7:2]:8787, bump: 50696 } timeout=30s
2023-06-26T17:34:13Z app[178199dc492228] cdg [info]2023-06-26T17:34:13.369396Z  WARN handle_data{len=167 header=Header { src: Actor { id: ActorId(c5450e0c-52ac-4c33-9e71-fbc64a1c1f77), addr: [fdaa:2:4742:a7b:18e:91b5:604a:2]:8787, bump: 17046 }, src_incarnation: 0, dst: Actor { id: ActorId(341fe755-52b3-4808-888a-0cb45c1ff189), addr: [fdaa:2:4742:a7b:aebf:9821:21ff:2]:8787, bump: 43456 }, message: ForwardedAck { origin: Actor { id: ActorId(23ce24c0-4ddf-4e00-8679-eeb40087af20), addr: [fdaa:2:4742:a7b:d5a7:25d0:75f7:2]:8787, bump: 50696 }, probe_number: 51 } } num_updates=1}: foca: Unexpected ForwardedAck sender
2023-06-26T17:34:13Z app[178199dc492228] cdg [info]2023-06-26T17:34:13.425022Z  WARN handle_data{len=167 header=Header { src: Actor { id: ActorId(339568fc-e676-48eb-9387-c67e381515f6), addr: [fdaa:2:4742:a7b:144:4161:8932:2]:8787, bump: 54737 }, src_incarnation: 0, dst: Actor { id: ActorId(341fe755-52b3-4808-888a-0cb45c1ff189), addr: [fdaa:2:4742:a7b:aebf:9821:21ff:2]:8787, bump: 43456 }, message: ForwardedAck { origin: Actor { id: ActorId(23ce24c0-4ddf-4e00-8679-eeb40087af20), addr: [fdaa:2:4742:a7b:d5a7:25d0:75f7:2]:8787, bump: 50696 }, probe_number: 51 } } num_updates=1}: foca: Unexpected ForwardedAck sender
2023-06-26T17:34:13Z app[178199dc492228] cdg [info]2023-06-26T17:34:13.470475Z  WARN handle_data{len=167 header=Header { src: Actor { id: ActorId(90501bcc-b720-4327-961f-c2fb7f74b01c), addr: [fdaa:2:4742:a7b:104:f3da:d4ad:2]:8787, bump: 9228 }, src_incarnation: 0, dst: Actor { id: ActorId(341fe755-52b3-4808-888a-0cb45c1ff189), addr: [fdaa:2:4742:a7b:aebf:9821:21ff:2]:8787, bump: 43456 }, message: ForwardedAck { origin: Actor { id: ActorId(23ce24c0-4ddf-4e00-8679-eeb40087af20), addr: [fdaa:2:4742:a7b:d5a7:25d0:75f7:2]:8787, bump: 50696 }, probe_number: 51 } } num_updates=1}: foca: Unexpected ForwardedAck sender

(the cdg node’s actor_id is 341fe755-52b3-4808-888a-0cb45c1ff189)

I’m not sure what is happening exactly, but I assume a bug in foca 😃 I’m using it in a fairly straightforward way.

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (9 by maintainers)

Most upvoted comments

I’ve released v0.13.0 with the changes I mentioned on https://github.com/caio/foca/issues/26#issuecomment-1614602079

Now you’ll be able to simply sort_unstable() the Vec with the piled up timer events; First index should be the first to be processed

If you’d rather keep you current solution, I’d just make sure to adjust seq to be Instant::now() + duration; The reason it works right now is because foca emits the events in the order they need to be fired, but that’s an implementation detail

I couldn’t figure out a way to trigger this scenario that didn’t involve messing with the timer, I’ll be assuming that was the cause. Closing this ticket, as always: feel free to reopen 😄

caio on Jul 9, 2023

Thank you! At a glance it does look like things run normally until suddenly the two timer events arrive out of order. I’ll be digging further to see if I can spot a bug on foca, but here’s my current hand wavy guess:

You’re using a task based timer (I do it too, it’s very easy); Even if tokio’s timer guarantees scheduling wakes in order after a lag (say: the vm gets suspended for a few seconds, thread time advances), it can’t guarantee that the tasks you spawned will be polled in the order in which they went from pending to ready.

To prevent this from happening, I think we’d need every Timer event to arrive and be handled by a single task (so, likely a timer impl running on top of a os thread like what I assume tokio does internally).

Not sure it’s worth the trouble because since the first time you encountered this error we’ve made sure foca can recover from bad sequencing in the probe; so long as things go back to normal, this error popping up under stress is ok.

It is, however, quite annoying. You’d want less noise when dealing with load issues; Especially this kinda noise in particular since it screams BUG at the logs (I’m very proud of this little logic check there 😬 ).

If it were me, since I know nothing about implementing timers reliably, I’d take a few steps to reduce the chances of this happening. I think these can help:

Use a single threaded eventloop if you aren’t already (assuming there’s a timer per thread in the runtime)
Rudimentary out-of-order delivery mitigation:

You now submit timers with a sequence number (a timestamp is easy, but can easily just be an atomic counter):

tokio::spawn(async move {
    let seqno = Instant::now(); // important: acquired before the sleep
    tokio::time::sleep(duration).await;
    timer_tx.send((timer, seqno)).await.ok();
});

Then in the receive loop you can always drain timer_rx: if there are more than one event in the buffer, you sort them based on seqno before submitting.

This way the window of opportunity for a sequencing issue to appear is way smaller than it currently is (can still happen tho). Notice that the real impact of these mitigations is reducing log noise, and preventing skipping a single probe round (we could make foca recover and probe immediately too).

I’ll spend some more time staring at the logs and state in the near future to see if I’m missing something - I haven’t discarded the possibility of a bug in foca; With my ad-hoc local simulation I haven’t triggered anything yet, but I did find a case where I can speed up suspicion refutation, minimizing false positives, so it’s a win already in my book 😃

caio on Jun 28, 2023