connectedhomeip: Reliable messaging crashes if a standalone ack is lost
Problem
Consider the following sequence of events:
- Initiator sends a reliable message to responder.
- Responder takes a while to process the message, hits standalone ack timeout, sends standalone ack.
- Standalone ack is lost.
- Responder sends an application-level response. No ack is piggybacked, because the standalone ack was already sent.
- Initiator processes application-level response, sends another application-level reliable message on the same exchange.
This crashes the initiator in ReliableMessageMgr::AddToRetransTable because of this line:
VerifyOrDie(rc != nullptr && !rc->IsOccupied());
In this case, rc->IsOccupied() is true: it’s waiting for an ack to that first message it sent.
Proposed Solution
We have a few options:
- Allow a given
ReliableMessageContextto be waiting for more than one ack. - Don’t clear ack information completely when sending a standalone ack; piggyback that same ack on the next application-level message as well. This will require fixing the behavior we have now where processing of a message that acks something we think does not need acking is completely aborted.
- Other ideas?
@turon @kghost @yufengwangca @pan-apple @msandstedt
Unit test exercising this scenario, which crashes on ToT:
diff --git a/src/messaging/tests/TestReliableMessageProtocol.cpp b/src/messaging/tests/TestReliableMessageProtocol.cpp
index 34d6ef108..ba23cd414 100644
--- a/src/messaging/tests/TestReliableMessageProtocol.cpp
+++ b/src/messaging/tests/TestReliableMessageProtocol.cpp
@@ -1111,6 +1111,135 @@ void CheckMessageAfterClosed(nlTestSuite * inSuite, void * inContext)
NL_TEST_ASSERT(inSuite, rm->TestGetCountRetransTable() == 0);
}
+void CheckLostStandaloneAck(nlTestSuite * inSuite, void * inContext)
+{
+ /**
+ * This tests the following scenario:
+ * 1) A reliable message is sent from initiator to responder.
+ * 2) The responder sends a standalone ack, which is lost.
+ * 3) The responder sends an application-level response.
+ * 4) The initiator sends a reliable response to the app-level response.
+ */
+ TestContext & ctx = *reinterpret_cast<TestContext *>(inContext);
+
+ chip::System::PacketBufferHandle buffer = chip::MessagePacketBuffer::NewWithData(PAYLOAD, sizeof(PAYLOAD));
+ NL_TEST_ASSERT(inSuite, !buffer.IsNull());
+
+ CHIP_ERROR err = CHIP_NO_ERROR;
+
+ MockAppDelegate mockReceiver;
+ err = ctx.GetExchangeManager().RegisterUnsolicitedMessageHandlerForType(Echo::MsgType::EchoRequest, &mockReceiver);
+ NL_TEST_ASSERT(inSuite, err == CHIP_NO_ERROR);
+
+ mockReceiver.mTestSuite = inSuite;
+
+ MockAppDelegate mockSender;
+ ExchangeContext * exchange = ctx.NewExchangeToAlice(&mockSender);
+ NL_TEST_ASSERT(inSuite, exchange != nullptr);
+
+ mockSender.mTestSuite = inSuite;
+
+ ReliableMessageMgr * rm = ctx.GetExchangeManager().GetReliableMessageMgr();
+ NL_TEST_ASSERT(inSuite, rm != nullptr);
+
+ // Ensure the retransmit table is empty right now
+ NL_TEST_ASSERT(inSuite, rm->TestGetCountRetransTable() == 0);
+
+ // We send a message, the other side sends a standalone ack first (which is
+ // lost), then an application response, then we respond to that response.
+ // We need to keep both exchanges alive for that (so we can send the
+ // response from the receiver and so the initial sender exchange can send a
+ // response to that).
+ gLoopback.mSentMessageCount = 0;
+ gLoopback.mNumMessagesToDrop = 0;
+ gLoopback.mDroppedMessageCount = 0;
+ mockReceiver.mRetainExchange = true;
+ mockSender.mRetainExchange = true;
+
+ // And ensure the ack heading back our way is dropped.
+ mockReceiver.mDropAckResponse = true;
+
+ err = exchange->SendMessage(Echo::MsgType::EchoRequest, std::move(buffer), SendFlags(SendMessageFlags::kExpectResponse));
+ NL_TEST_ASSERT(inSuite, err == CHIP_NO_ERROR);
+
+ // Ensure the message was sent.
+ NL_TEST_ASSERT(inSuite, gLoopback.mSentMessageCount == 1);
+ NL_TEST_ASSERT(inSuite, gLoopback.mDroppedMessageCount == 0);
+
+ // And that it was received.
+ NL_TEST_ASSERT(inSuite, mockReceiver.IsOnMessageReceivedCalled);
+
+ // And that we have not gotten any app-level responses or acks so far.
+ NL_TEST_ASSERT(inSuite, !mockSender.IsOnMessageReceivedCalled);
+ NL_TEST_ASSERT(inSuite, rm->TestGetCountRetransTable() == 1);
+
+ ReliableMessageContext * receiverRc = mockReceiver.mExchange->GetReliableMessageContext();
+ // Ack should have been dropped.
+ NL_TEST_ASSERT(inSuite, !receiverRc->IsAckPending());
+
+ // Don't drop any more acks.
+ mockReceiver.mDropAckResponse = false;
+
+ // Now send a message from the other side.
+ buffer = chip::MessagePacketBuffer::NewWithData(PAYLOAD, sizeof(PAYLOAD));
+ NL_TEST_ASSERT(inSuite, !buffer.IsNull());
+
+ mockReceiver.mExchange->SendMessage(Echo::MsgType::EchoResponse, std::move(buffer),
+ SendFlags(SendMessageFlags::kExpectResponse));
+
+ // Ensure the response was sent.
+ NL_TEST_ASSERT(inSuite, gLoopback.mSentMessageCount == 2);
+ NL_TEST_ASSERT(inSuite, gLoopback.mDroppedMessageCount == 0);
+
+ // Ensure that we have received that response and it did not have a
+ // piggyback ack.
+ NL_TEST_ASSERT(inSuite, mockSender.IsOnMessageReceivedCalled);
+ NL_TEST_ASSERT(inSuite, !mockSender.mReceivedPiggybackAck);
+ // We now have our un-acked message still waiting to retransmit and the
+ // message we just received is waiting for an ack.
+ NL_TEST_ASSERT(inSuite, rm->TestGetCountRetransTable() == 2);
+
+ // Reset various state so we can measure things again.
+ mockReceiver.IsOnMessageReceivedCalled = false;
+ mockReceiver.mReceivedPiggybackAck = false;
+ mockSender.IsOnMessageReceivedCalled = false;
+ mockSender.mReceivedPiggybackAck = false;
+
+ // Stop retaining the recipient exchange.
+ mockReceiver.mRetainExchange = false;
+
+ // Now send a new message to the other side.
+ buffer = chip::MessagePacketBuffer::NewWithData(PAYLOAD, sizeof(PAYLOAD));
+ NL_TEST_ASSERT(inSuite, !buffer.IsNull());
+
+ // Claim (falsely) that we expect a response, so exchange stays open and we
+ // can use it to clear out reliable message state later.
+ err = exchange->SendMessage(Echo::MsgType::EchoRequest, std::move(buffer),
+ SendFlags(SendMessageFlags::kExpectResponse));
+ NL_TEST_ASSERT(inSuite, err == CHIP_NO_ERROR);
+
+ // Ensure the message and the standalone ack to it were sent.
+ NL_TEST_ASSERT(inSuite, gLoopback.mSentMessageCount == 4);
+ NL_TEST_ASSERT(inSuite, gLoopback.mDroppedMessageCount == 0);
+
+ // And that it was received.
+ NL_TEST_ASSERT(inSuite, mockReceiver.IsOnMessageReceivedCalled);
+ NL_TEST_ASSERT(inSuite, mockReceiver.mReceivedPiggybackAck);
+
+ // And that receiver has no ack pending.
+ NL_TEST_ASSERT(inSuite, !receiverRc->IsAckPending());
+
+ // And that we are still expecting an ack for the one message an ack was
+ // lost for.
+ NL_TEST_ASSERT(inSuite, rm->TestGetCountRetransTable() == 1);
+
+ // Clear out our one pending retransmission entry.
+ rm->ClearRetransTable(exchange);
+
+ // And we should have no pending work left now.
+ NL_TEST_ASSERT(inSuite, rm->TestGetCountRetransTable() == 0);
+}
+
// Test Suite
/**
@@ -1134,6 +1263,7 @@ const nlTest sTests[] =
NL_TEST_DEF("Test ReliableMessageMgr::CheckSendStandaloneAckMessage", CheckSendStandaloneAckMessage),
NL_TEST_DEF("Test command, response, default response, with receiver closing exchange after sending response", CheckMessageAfterClosed),
NL_TEST_DEF("Test that unencrypted message is dropped if exchange requires encryption", CheckUnencryptedMessageReceiveFailure),
+ NL_TEST_DEF("Test that an application-level response-to-response after a lost standalone ack to the initial message works", CheckLostStandaloneAck),
NL_TEST_SENTINEL()
};
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (16 by maintainers)
Commits related to this issue
- Handle a lost standalone ack correctly. If a standalone ack gets lost and then there is an application-level reply to the message that triggered the standalone ack, we want to piggyback an ack on tha... — committed to bzbarsky-apple/connectedhomeip by bzbarsky-apple 3 years ago
- Handle a lost standalone ack correctly. If a standalone ack gets lost and then there is an application-level reply to the message that triggered the standalone ack, we want to piggyback an ack on tha... — committed to bzbarsky-apple/connectedhomeip by bzbarsky-apple 3 years ago
- Handle a lost standalone ack correctly. (#9895) * Handle a lost standalone ack correctly. If a standalone ack gets lost and then there is an application-level reply to the message that triggered ... — committed to project-chip/connectedhomeip by bzbarsky-apple 3 years ago
I’ll do the second bullet thing.