openthread: MbedTls failure during commissioning on slow platforms.

We’re seeing issues when running the OpenThread stack on an older platform, where the crypto computations are done on the CPU and are not fast enough to avoid DTLS timeouts on the other side.

The issues are similar to, but not the same as, what is reported in: https://github.com/openthread/openthread/pull/1207

In our case the code (including the fix “0001-Fix-commissioning-problem-with-retransmissions.patch”) behaves as follows:

Note that in my setup the Joiner is the slow platform which takes more than 8 seconds to perform it’s computations, whereas the Commissioner is on a faster platform (therefor there are no repeats of the ‘Client Hello’ messages).

   Joiner                               Commissioner
(1)  |              Client Hello              |
     |--------------------------------------->|
     |                                        |
(2)  |          Hello Verify Request          |
     |<---------------------------------------|
     |                                        |
(3)  |         Client Hello + cookie          |
     |--------------------------------------->|
     |                                        |
(4)  |              Server Hello              |
     |          Server Key Exchange           |
     |           Server Hello Done            |
     |<---------------------------------------|
     |                                        | |
(5)  |              Server Hello              | |  DTLS timeout
     |          Server Key Exchange           | |
     |           Server Hello Done            | v
     |<---------------------------------------| 
     |                                        |
(6)  |          Client Key Exchange           |
     |           Change Cipher Spec           |
     |      Encrypted Handshake Message       |
     |        Client Key Exchange (DUP)       |
     |        Change Cipher Spec (DUP)        |
     |   Encrypted Handshake Message (DUP)    |
     |--------------------------------------->|
     |                                        |
(7)  |           Change Cipher Spec           |
     |      Encrypted Handshake Message       |
     |<---------------------------------------|
     |                                        |
(8)  |            Application Data            |
     |--------------------------------------->|
     |                                        |
(9)  |        Change Cipher Spec (DUP)        |
     |   Encrypted Handshake Message (DUP)    |
     |         Application Data (DUP)         |
     |<---------------------------------------|
     |                                        |

Because the Joiner takes a long time to compute the data for (6), one sees that on the Commissioner side the 8s timeout gets triggered, and the previous message (4) is retransmitted as (5).

After creating the data for (6) the TLS code notices that there are repeat messages, and decides (as per spec) to reply to both (4) and (5). In our case both replies end up in the same message (6) !

Now, the Commissioner replies with (7), but only removes the first half of (6) out of its rx queue.

When the Commissioner receives (8), it notices the remaining part of (6) is still in its rx queue, and replies to this first with the “Change Cipher Spec (DUP)”. Additionally, it also replies to the recently received Application Data (8). Both replies end up in (9).

At this point, the Joiner code has moved to a state where it’s looking for “Application Data”, but there is unexpected data (Change Cipher Spec (DUP)) in front of it. The current DTLS code doesn’t handle this situation: it sees an incorrect record and discards the complete UDP packet (and not only the offending record).

That’s the current situation. It is clear that the current patch does only fixes a single variation of the problem (again, see PR1207), and not the bigger problem of the DTLS state machine ending up in an incorrect state.

There are multiple paths to fix the underlying problem:

one could adapt the code to also handle the “Change Cipher Spec” by trplying to it (I’ve tried it and it works). But this requires again multiple fixes, and I don’t think this is the correct way forward: in fact it’s like handling multiple parallel handshakes. The sniffer logs end up looking very messy with all those (unnecessary) retransmission. Also, it spins out of control, with multiple retransmissions on both sides when two older platforms start a DTLS session between one another.
we should reconsider sending the initial retransmit of the “Client Key Exchange”. The spec clearly states it should be done:

"2. The implementation reads a retransmitted flight from the peer: the
  implementation transitions to the SENDING state, where it
  retransmits the flight, resets the retransmit timer, and returns
  to the WAITING state.  The rationale here is that the receipt of a
  duplicate message is the likely result of timer expiry on the peer
  and therefore suggests that part of one's previous flight was
  lost."

But, it also mentions that the reasoning behind this rule is to retransmit a flight that was previously lost. In the outlined case there is no previous flight. When one runs a slow platform, it’s obvious that the retransmits are NOT caused by a lost previous flight, so why send multiple replies in that case ?

The code that triggers this type of retransmit is located in third_party/mbedtls/repo/library/ssl_tls.c:

int mbedtls_ssl_prepare_handshake_record( mbedtls_ssl_context *ssl ) {
...
            if( recv_msg_seq == ssl->handshake->in_flight_start_seq - 1 &&
                ssl->in_msg[0] != MBEDTLS_SSL_HS_HELLO_VERIFY_REQUEST )

To still benefit from the speed improvements, it is probably not required to entirely remove this retransmission mechanism. Only remove the retransmissions

@ compile time: only for slow platforms
@ run time: only for the state(s) with known long computational delays.

Note: disabling this code will not create issues in case were a message really gets lost, because there is always the local timer which will timeout when no reply is received. This, in turn will (locally) trigger a retransmit.

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 44 (27 by maintainers)

Most upvoted comments

It would be great to have this next version of mbedtls in OpenThread

thcu-gp on Apr 17, 2018

Update: https://github.com/ARMmbed/mbedtls/pull/946 has been merged and will be included in the next release of Mbed TLS. Thanks for your patience and help!

hanno-becker on Apr 5, 2018

@hubertmis Ok, the merge conflicts will be resolved eventually. Knowing that #1141 together with https://github.com/ARMmbed/mbedtls/commit/52c6dc6 works fine provides enough confidence to go forward with it for now. Thanks again for all the input and the report. Let me know if you encounter any other problems.

hanno-becker on Oct 13, 2017

@hubertmis That’s great news. So let me summarize the outcome of this:

PR https://github.com/ARMmbed/mbedtls/pull/946 fixes that sometimes datagrams are unrightfully dropped when they contain a stale ChangeCipherSpec. That’s what initially triggered this issue.
Patch https://github.com/mpg/mbedtls/commit/2694afbb56639f2135f1dd48b3ef903df6ac7ecd fixes that handshake messages are treated as fatal in ssl_read if MBEDTLS_SSL_RENEGOTIATION is disabled, a bug that makes the scenario in this issue continue to fail even after (1) has been fixed. I will follow up on this and give you a note once a decision for the proper long-term solution has been made and a PR has potentially been opened. Please note that the patch shouldn’t be used in production code as it hasn’t undergone thorough study and testing.
As a general comment, WANT_READ must be expected even when calling mbedtls_ssl_read in response to an incoming message. It must also be expected in response to mbedtls_ssl_write provided renegotiation is used or the initial handshake is potentially still in progress. As the latter doesn’t seem to be the case for your code, at the moment it seems that asserting res != WANT_READ should be fine.

hanno-becker on Oct 12, 2017