pubsubclient: Potential memory corruption
I had random crashes that I could pin to the PubSubClient. Further testing made it possible to pin it down to a specific part of the code. Incoming and outgoing messages use the same buffer. If a publish is in the making while a callback is triggered, an incorrect value could be read from the buffer for the topic length, resulting in an out of memory bound memmove which will cause heap corruption.
uint16_t tl = (this->buffer[llen+1]<<8)+this->buffer[llen+2]; /* topic length in bytes */
memmove(this->buffer+llen+2,this->buffer+llen+3,tl); /* move topic inside buffer 1 byte to front */
Shouldn’t we at least have a simple check here that tl < bufferSize? Cause that should never happen but it might in the described situation. ie.:
uint16_t tl = (this->buffer[llen+1]<<8)+this->buffer[llen+2]; /* topic length in bytes */
if(!(tl < bufferSize)) return false;
memmove(this->buffer+llen+2,this->buffer+llen+3,tl); /* move topic inside buffer 1 byte to front */
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 3
- Comments: 49 (2 by maintainers)
Commits related to this issue
- Version 2.2-beta HW Rev 2: - Spilt program in multiple tasks, this makes it easier to spot issues - Stale tasks at least should result in a log entry - Include reset reason in human readable format ... — committed to arjenhiemstra/ithowifi by arjenhiemstra 3 years ago
- [MQTT] PubSubClient checks when callback + publish happen simultaneous After discussion with @arjenhiemstra on issue: https://github.com/knolleary/pubsubclient/issues/832 — committed to TD-er/ESPEasy by TD-er 3 years ago
- [MQTT] PubSubClient checks when callback + publish happen simultaneous After discussion with @arjenhiemstra on issue: https://github.com/knolleary/pubsubclient/issues/832 — committed to TD-er/ESPEasy by TD-er 3 years ago
- Resolve issue #832 (Potential memory corruption) The memmove command is executed without any sanity checking on used values causing random reboots when an MQTT package gets processed with an impossib... — committed to arjenhiemstra/pubsubclient by arjenhiemstra 3 years ago
I have been fighting with this issue as well for my ESP32 projects (random crashes and hanging code). Finally I boiled it down to getting extra 0xff bytes in the incoming TCP stream from my WiFiClient from time to time. The problem is on the TCP layer or even below, not with PubSubClient. I successfully reproduced this just with a TCP connection w/o any functionality above, but I have no idea yet where the extra bytes are coming from.
As a consequence, PubSubClient runs into trouble when decoding such packets, especially when it comes to package lengths. I identified four points where that happens:
Case 1 is already handled by disconnecting the MQTT connection. I guess this is the best way to deal with a TCP stream that is out of sync. After reconnecting, we are back to normal. I modified the PubSubClient code to handle the other 3 cases similarly by disconnecting. And voila: Lots of disconnects, but no more random crashes and hanging code so far! (Tested by hammering the client, just like Arjen did. And running in production code for 1/2 year.)
These issues do not happen with correct packets. However, I think this should be addressed anyhow. Bad guys might even exploit this by intentionally sending malformed packets…
I prepared PR #843 including some debugging code. Anybody willing to crosscheck, maybe @genotix or @TD-er ?
@arjenhiemstra : Funny! Same ideas, same timing. I like that!
The Arduino ESP32 framework is using FreeRTOS, see: https://github.com/espressif/arduino-esp32/blob/8dc70e0add06e75da5a255ba94fe86df98ae5ef1/cores/esp32/main.cpp#L32 where the main looptask is created. If you want to measure other tasks you could replace the NULL in this statement:
uint32_t memHighLevelMark = uxTaskGetStackHighWaterMark( NULL );with the task’s handle. For the Arduino task this handle is defined here: https://github.com/espressif/arduino-esp32/blob/8dc70e0add06e75da5a255ba94fe86df98ae5ef1/cores/esp32/main.cpp#L10 So for the arduino task this would become:
uint32_t memHighLevelMark = uxTaskGetStackHighWaterMark( loopTaskHandle );If your lib uses tasks and defines a task handle, you could use this handle to specifically measure that task.
Sometimes things just start to move along.
The dev over SSLClient just indicated an incoming modification; he has found the issue: https://github.com/OPEnSLab-OSU/SSLClient/issues/30#issuecomment-803735657
Aside from this I think I might have found an alternative to the SSLClient mentioned above: https://github.com/govorox/SSLClient which uses mbedTLS instead of BearSSL
Unhandled debug exception
This will usually be followed by a message like:
Debug exception reason: Stack canary watchpoint triggered (task_name) Copy to clipboard This error indicates that application has written past the end of the stack of task_name task. Note that not every stack overflow is guaranteed to trigger this error. It is possible that the task writes to stack beyond the stack canary location, in which case the watchpoint will not be triggered.
—- So it seems you have ran out of memory
You should make sure that a callback due to a MQTT message coming in never can occur at the same time a publish is underway. Do not use ISR for callback and indeed as @marcel7234 states, try to do every thing from a main loop. If you are using tasks it could be an option to protect parts of your code using a mutex lock. Nice recent vid from digi-key about this subject: https://www.youtube.com/watch?v=I55auRpbiTs
Sounds great @marcel7234 !!! I will try to test this tonight (if my wife admits)!
if we get this stuff stable we should do an online beer session; my treat! I have been struggling with these random errors for over 2 weeks now.
@arjenhiemstra happy to review a PR with those improvements in given it sounds like it’s been thoroughly tested - far more so than I would be able to do.
A big shoutout to @marcel7234 and @arjenhiemstra 's efforts on this topic though! I think we have all been banging our head into this topic but i surely see improvement in this library.
Thank you, guys!
Well, the operating system behind your Arduino sketches is FreeRTOS. Hence you are using it.
Have a look at the documentation of uxTaskGetStackHighWaterMark():
The stack used by a task will grow and shrink as the task executes and interrupts are processed. uxTaskGetStackHighWaterMark() returns the minimum amount of remaining stack space that was available to the task since the task started executing - that is the amount of stack that remained unused when the task stack was at its greatest (deepest) value. This is what is referred to as the stack 'high water mark'.So you get the full picture of your complete task: minimum stack size achieved so far. I guess it is sufficient to focus on your own task for now. Very likely that’s where the problem is.Perfectly aligned, @arjenhiemstra ! 😃
timing @marcel7234 😉 lol!
That’s the canary again. You should figure out the reason for your stack overflow, that’s highest priority. Functions like uxTaskGetStackHighWaterMark() can be handy to identify a growing stack.
That is to be expected because the de kernel panic indicates you have ran out of memory. Stack memory to be precise. That most probably means the task you are running has no more stack memory left. You can find the current task’s high level memory mark with this statement within the task you want to measure.
uint32_t memHighLevelMark = uxTaskGetStackHighWaterMark( NULL ); (and then log it or print it to serial for example)
Be aware that the arduino code (loop() ) is also running in a task. This task has standard 8k of stack memory, if you are using non dynamic allocated mem for SSL, a big MQTT buffer etc. this could cause to running out of mem quite fast.
Currently at 5000+ messages; going strong. I’m hoping the issue I am having is a “Read Buffer Overflow” as stated here.
My focus is currently on that anyway since Marcels code seems to respond more stable than the original and also the fix that @arjenhiemstra did (which already improved the connection -A LOT-) I really think I’m seeing a huge improvement. (at least specifically for my case)
Thumbs up and will keep you posted!
@genotix I think it is even more important to note the crash reason as well.
Also a bit more context may be needed to interpret the crash. For example if you’re synchronizing data among tasks you may need to have locking or using semaphores to make sure the data used is still there while it is being sent.
I am using SSLClient for a TLS 1.2 connection to AWS IoT Core so this could be related to that, just how to figure it out… My connection currently is a SIM800L GPRS modem but I could switch to Wifi.
Just for addition; I’m transmitting between 1 and 3 seconds and the payload doesn’t differ unless the system had to do any kind of reconnect.
My setBufferSize() is differing but shouldn’t hit far over 2Kb. (I have a partly Base64 payload so to be efficient I calculate the max payload size on the post Base64 processing size)
I received a crash after about 2500 messages:
Decoded stack results
Very nice analysis and great work @marcel7234!! I’ll try to test the code as well!
@genotix great this helped you as well!
With this check the device doesn’t corrupt the memory anymore but doesn’t fix the underlying issue (can still cause timeouts) With the help of TD-er and a lot of testing the loop code was even further improved. Have a look at: https://github.com/arjenhiemstra/PubSubClientStatic
Tested with rate of about 50.000 mqtt messages per hour and is rock steady stable now!