openthread: MbedTls failure during commissioning on slow platforms.
We’re seeing issues when running the OpenThread stack on an older platform, where the crypto computations are done on the CPU and are not fast enough to avoid DTLS timeouts on the other side.
The issues are similar to, but not the same as, what is reported in: https://github.com/openthread/openthread/pull/1207
In our case the code (including the fix “0001-Fix-commissioning-problem-with-retransmissions.patch”) behaves as follows:
Note that in my setup the Joiner is the slow platform which takes more than 8 seconds to perform it’s computations, whereas the Commissioner is on a faster platform (therefor there are no repeats of the ‘Client Hello’ messages).
Joiner Commissioner
(1) | Client Hello |
|--------------------------------------->|
| |
(2) | Hello Verify Request |
|<---------------------------------------|
| |
(3) | Client Hello + cookie |
|--------------------------------------->|
| |
(4) | Server Hello |
| Server Key Exchange |
| Server Hello Done |
|<---------------------------------------|
| | |
(5) | Server Hello | | DTLS timeout
| Server Key Exchange | |
| Server Hello Done | v
|<---------------------------------------|
| |
(6) | Client Key Exchange |
| Change Cipher Spec |
| Encrypted Handshake Message |
| Client Key Exchange (DUP) |
| Change Cipher Spec (DUP) |
| Encrypted Handshake Message (DUP) |
|--------------------------------------->|
| |
(7) | Change Cipher Spec |
| Encrypted Handshake Message |
|<---------------------------------------|
| |
(8) | Application Data |
|--------------------------------------->|
| |
(9) | Change Cipher Spec (DUP) |
| Encrypted Handshake Message (DUP) |
| Application Data (DUP) |
|<---------------------------------------|
| |
Because the Joiner takes a long time to compute the data for (6), one sees that on the Commissioner side the 8s timeout gets triggered, and the previous message (4) is retransmitted as (5).
After creating the data for (6) the TLS code notices that there are repeat messages, and decides (as per spec) to reply to both (4) and (5). In our case both replies end up in the same message (6) !
Now, the Commissioner replies with (7), but only removes the first half of (6) out of its rx queue.
When the Commissioner receives (8), it notices the remaining part of (6) is still in its rx queue, and replies to this first with the “Change Cipher Spec (DUP)”. Additionally, it also replies to the recently received Application Data (8). Both replies end up in (9).
At this point, the Joiner code has moved to a state where it’s looking for “Application Data”, but there is unexpected data (Change Cipher Spec (DUP)) in front of it. The current DTLS code doesn’t handle this situation: it sees an incorrect record and discards the complete UDP packet (and not only the offending record).
That’s the current situation. It is clear that the current patch does only fixes a single variation of the problem (again, see PR1207), and not the bigger problem of the DTLS state machine ending up in an incorrect state.
There are multiple paths to fix the underlying problem:
-
one could adapt the code to also handle the “Change Cipher Spec” by trplying to it (I’ve tried it and it works). But this requires again multiple fixes, and I don’t think this is the correct way forward: in fact it’s like handling multiple parallel handshakes. The sniffer logs end up looking very messy with all those (unnecessary) retransmission. Also, it spins out of control, with multiple retransmissions on both sides when two older platforms start a DTLS session between one another.
-
we should reconsider sending the initial retransmit of the “Client Key Exchange”. The spec clearly states it should be done:
"2. The implementation reads a retransmitted flight from the peer: the implementation transitions to the SENDING state, where it retransmits the flight, resets the retransmit timer, and returns to the WAITING state. The rationale here is that the receipt of a duplicate message is the likely result of timer expiry on the peer and therefore suggests that part of one's previous flight was lost."
But, it also mentions that the reasoning behind this rule is to retransmit a flight that was previously lost. In the outlined case there is no previous flight. When one runs a slow platform, it’s obvious that the retransmits are NOT caused by a lost previous flight, so why send multiple replies in that case ?
The code that triggers this type of retransmit is located in third_party/mbedtls/repo/library/ssl_tls.c:
int mbedtls_ssl_prepare_handshake_record( mbedtls_ssl_context *ssl ) {
...
if( recv_msg_seq == ssl->handshake->in_flight_start_seq - 1 &&
ssl->in_msg[0] != MBEDTLS_SSL_HS_HELLO_VERIFY_REQUEST )
To still benefit from the speed improvements, it is probably not required to entirely remove this retransmission mechanism. Only remove the retransmissions
- @ compile time: only for slow platforms
- @ run time: only for the state(s) with known long computational delays.
Note: disabling this code will not create issues in case were a message really gets lost, because there is always the local timer which will timeout when no reply is received. This, in turn will (locally) trigger a retransmit.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 44 (27 by maintainers)
It would be great to have this next version of mbedtls in OpenThread
Update: https://github.com/ARMmbed/mbedtls/pull/946 has been merged and will be included in the next release of Mbed TLS. Thanks for your patience and help!
@hubertmis Ok, the merge conflicts will be resolved eventually. Knowing that #1141 together with https://github.com/ARMmbed/mbedtls/commit/52c6dc6 works fine provides enough confidence to go forward with it for now. Thanks again for all the input and the report. Let me know if you encounter any other problems.
@hubertmis That’s great news. So let me summarize the outcome of this:
ChangeCipherSpec. That’s what initially triggered this issue.ssl_readifMBEDTLS_SSL_RENEGOTIATIONis disabled, a bug that makes the scenario in this issue continue to fail even after (1) has been fixed. I will follow up on this and give you a note once a decision for the proper long-term solution has been made and a PR has potentially been opened. Please note that the patch shouldn’t be used in production code as it hasn’t undergone thorough study and testing.WANT_READmust be expected even when callingmbedtls_ssl_readin response to an incoming message. It must also be expected in response tombedtls_ssl_writeprovided renegotiation is used or the initial handshake is potentially still in progress. As the latter doesn’t seem to be the case for your code, at the moment it seems that assertingres != WANT_READshould be fine.