connectedhomeip: Reliable messaging crashes if a standalone ack is lost

Problem

Consider the following sequence of events:

Initiator sends a reliable message to responder.
Responder takes a while to process the message, hits standalone ack timeout, sends standalone ack.
Standalone ack is lost.
Responder sends an application-level response. No ack is piggybacked, because the standalone ack was already sent.
Initiator processes application-level response, sends another application-level reliable message on the same exchange.

This crashes the initiator in ReliableMessageMgr::AddToRetransTable because of this line:

VerifyOrDie(rc != nullptr && !rc->IsOccupied());

In this case, rc->IsOccupied() is true: it’s waiting for an ack to that first message it sent.

Proposed Solution

We have a few options:

Allow a given ReliableMessageContext to be waiting for more than one ack.
Don’t clear ack information completely when sending a standalone ack; piggyback that same ack on the next application-level message as well. This will require fixing the behavior we have now where processing of a message that acks something we think does not need acking is completely aborted.
Other ideas?

@turon @kghost @yufengwangca @pan-apple @msandstedt

Unit test exercising this scenario, which crashes on ToT:

diff --git a/src/messaging/tests/TestReliableMessageProtocol.cpp b/src/messaging/tests/TestReliableMessageProtocol.cpp
index 34d6ef108..ba23cd414 100644
--- a/src/messaging/tests/TestReliableMessageProtocol.cpp
+++ b/src/messaging/tests/TestReliableMessageProtocol.cpp
@@ -1111,6 +1111,135 @@ void CheckMessageAfterClosed(nlTestSuite * inSuite, void * inContext)
     NL_TEST_ASSERT(inSuite, rm->TestGetCountRetransTable() == 0);
 }
 
+void CheckLostStandaloneAck(nlTestSuite * inSuite, void * inContext)
+{
+    /**
+     * This tests the following scenario:
+     * 1) A reliable message is sent from initiator to responder.
+     * 2) The responder sends a standalone ack, which is lost.
+     * 3) The responder sends an application-level response.
+     * 4) The initiator sends a reliable response to the app-level response.
+     */
+    TestContext & ctx = *reinterpret_cast<TestContext *>(inContext);
+
+    chip::System::PacketBufferHandle buffer = chip::MessagePacketBuffer::NewWithData(PAYLOAD, sizeof(PAYLOAD));
+    NL_TEST_ASSERT(inSuite, !buffer.IsNull());
+
+    CHIP_ERROR err = CHIP_NO_ERROR;
+
+    MockAppDelegate mockReceiver;
+    err = ctx.GetExchangeManager().RegisterUnsolicitedMessageHandlerForType(Echo::MsgType::EchoRequest, &mockReceiver);
+    NL_TEST_ASSERT(inSuite, err == CHIP_NO_ERROR);
+
+    mockReceiver.mTestSuite = inSuite;
+
+    MockAppDelegate mockSender;
+    ExchangeContext * exchange = ctx.NewExchangeToAlice(&mockSender);
+    NL_TEST_ASSERT(inSuite, exchange != nullptr);
+
+    mockSender.mTestSuite = inSuite;
+
+    ReliableMessageMgr * rm = ctx.GetExchangeManager().GetReliableMessageMgr();
+    NL_TEST_ASSERT(inSuite, rm != nullptr);
+
+    // Ensure the retransmit table is empty right now
+    NL_TEST_ASSERT(inSuite, rm->TestGetCountRetransTable() == 0);
+
+    // We send a message, the other side sends a standalone ack first (which is
+    // lost), then an application response, then we respond to that response.
+    // We need to keep both exchanges alive for that (so we can send the
+    // response from the receiver and so the initial sender exchange can send a
+    // response to that).
+    gLoopback.mSentMessageCount    = 0;
+    gLoopback.mNumMessagesToDrop   = 0;
+    gLoopback.mDroppedMessageCount = 0;
+    mockReceiver.mRetainExchange   = true;
+    mockSender.mRetainExchange     = true;
+
+    // And ensure the ack heading back our way is dropped.
+    mockReceiver.mDropAckResponse = true;
+
+    err = exchange->SendMessage(Echo::MsgType::EchoRequest, std::move(buffer), SendFlags(SendMessageFlags::kExpectResponse));
+    NL_TEST_ASSERT(inSuite, err == CHIP_NO_ERROR);
+
+    // Ensure the message was sent.
+    NL_TEST_ASSERT(inSuite, gLoopback.mSentMessageCount == 1);
+    NL_TEST_ASSERT(inSuite, gLoopback.mDroppedMessageCount == 0);
+
+    // And that it was received.
+    NL_TEST_ASSERT(inSuite, mockReceiver.IsOnMessageReceivedCalled);
+
+    // And that we have not gotten any app-level responses or acks so far.
+    NL_TEST_ASSERT(inSuite, !mockSender.IsOnMessageReceivedCalled);
+    NL_TEST_ASSERT(inSuite, rm->TestGetCountRetransTable() == 1);
+
+    ReliableMessageContext * receiverRc = mockReceiver.mExchange->GetReliableMessageContext();
+    // Ack should have been dropped.
+    NL_TEST_ASSERT(inSuite, !receiverRc->IsAckPending());
+
+    // Don't drop any more acks.
+    mockReceiver.mDropAckResponse = false;
+
+    // Now send a message from the other side.
+    buffer = chip::MessagePacketBuffer::NewWithData(PAYLOAD, sizeof(PAYLOAD));
+    NL_TEST_ASSERT(inSuite, !buffer.IsNull());
+
+    mockReceiver.mExchange->SendMessage(Echo::MsgType::EchoResponse, std::move(buffer),
+                                        SendFlags(SendMessageFlags::kExpectResponse));
+
+    // Ensure the response was sent.
+    NL_TEST_ASSERT(inSuite, gLoopback.mSentMessageCount == 2);
+    NL_TEST_ASSERT(inSuite, gLoopback.mDroppedMessageCount == 0);
+
+    // Ensure that we have received that response and it did not have a
+    // piggyback ack.
+    NL_TEST_ASSERT(inSuite, mockSender.IsOnMessageReceivedCalled);
+    NL_TEST_ASSERT(inSuite, !mockSender.mReceivedPiggybackAck);
+    // We now have our un-acked message still waiting to retransmit and the
+    // message we just received is waiting for an ack.
+    NL_TEST_ASSERT(inSuite, rm->TestGetCountRetransTable() == 2);
+
+    // Reset various state so we can measure things again.
+    mockReceiver.IsOnMessageReceivedCalled = false;
+    mockReceiver.mReceivedPiggybackAck     = false;
+    mockSender.IsOnMessageReceivedCalled   = false;
+    mockSender.mReceivedPiggybackAck       = false;
+
+    // Stop retaining the recipient exchange.
+    mockReceiver.mRetainExchange = false;
+
+    // Now send a new message to the other side.
+    buffer = chip::MessagePacketBuffer::NewWithData(PAYLOAD, sizeof(PAYLOAD));
+    NL_TEST_ASSERT(inSuite, !buffer.IsNull());
+
+    // Claim (falsely) that we expect a response, so exchange stays open and we
+    // can use it to clear out reliable message state later.
+    err = exchange->SendMessage(Echo::MsgType::EchoRequest, std::move(buffer),
+                                SendFlags(SendMessageFlags::kExpectResponse));
+    NL_TEST_ASSERT(inSuite, err == CHIP_NO_ERROR);
+
+    // Ensure the message and the standalone ack to it were sent.
+    NL_TEST_ASSERT(inSuite, gLoopback.mSentMessageCount == 4);
+    NL_TEST_ASSERT(inSuite, gLoopback.mDroppedMessageCount == 0);
+
+    // And that it was received.
+    NL_TEST_ASSERT(inSuite, mockReceiver.IsOnMessageReceivedCalled);
+    NL_TEST_ASSERT(inSuite, mockReceiver.mReceivedPiggybackAck);
+
+    // And that receiver has no ack pending.
+    NL_TEST_ASSERT(inSuite, !receiverRc->IsAckPending());
+
+    // And that we are still expecting an ack for the one message an ack was
+    // lost for.
+    NL_TEST_ASSERT(inSuite, rm->TestGetCountRetransTable() == 1);
+
+    // Clear out our one pending retransmission entry.
+    rm->ClearRetransTable(exchange);
+
+    // And we should have no pending work left now.
+    NL_TEST_ASSERT(inSuite, rm->TestGetCountRetransTable() == 0);
+}
+
 // Test Suite
 
 /**
@@ -1134,6 +1263,7 @@ const nlTest sTests[] =
     NL_TEST_DEF("Test ReliableMessageMgr::CheckSendStandaloneAckMessage", CheckSendStandaloneAckMessage),
     NL_TEST_DEF("Test command, response, default response, with receiver closing exchange after sending response", CheckMessageAfterClosed),
     NL_TEST_DEF("Test that unencrypted message is dropped if exchange requires encryption", CheckUnencryptedMessageReceiveFailure),
+    NL_TEST_DEF("Test that an application-level response-to-response after a lost standalone ack to the initial message works", CheckLostStandaloneAck),
 
     NL_TEST_SENTINEL()
 };

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (16 by maintainers)

Commits related to this issue

Handle a lost standalone ack correctly. If a standalone ack gets lost and then there is an application-level reply to the message that triggered the standalone ack, we want to piggyback an ack on tha... — committed to bzbarsky-apple/connectedhomeip by bzbarsky-apple 3 years ago
Handle a lost standalone ack correctly. If a standalone ack gets lost and then there is an application-level reply to the message that triggered the standalone ack, we want to piggyback an ack on tha... — committed to bzbarsky-apple/connectedhomeip by bzbarsky-apple 3 years ago
Handle a lost standalone ack correctly. (#9895) * Handle a lost standalone ack correctly. If a standalone ack gets lost and then there is an application-level reply to the message that triggered ... — committed to project-chip/connectedhomeip by bzbarsky-apple 3 years ago

Most upvoted comments

I’ll do the second bullet thing.

bzbarsky-apple on Sep 18, 2021