pubsubclient: Potential memory corruption

I had random crashes that I could pin to the PubSubClient. Further testing made it possible to pin it down to a specific part of the code. Incoming and outgoing messages use the same buffer. If a publish is in the making while a callback is triggered, an incorrect value could be read from the buffer for the topic length, resulting in an out of memory bound memmove which will cause heap corruption.

  uint16_t tl = (this->buffer[llen+1]<<8)+this->buffer[llen+2]; /* topic length in bytes */
  memmove(this->buffer+llen+2,this->buffer+llen+3,tl); /* move topic inside buffer 1 byte to front */

Shouldn’t we at least have a simple check here that tl < bufferSize? Cause that should never happen but it might in the described situation. ie.:

uint16_t tl = (this->buffer[llen+1]<<8)+this->buffer[llen+2]; /* topic length in bytes */
if(!(tl < bufferSize)) return false;
memmove(this->buffer+llen+2,this->buffer+llen+3,tl); /* move topic inside buffer 1 byte to front */

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 3
Comments: 49 (2 by maintainers)

Commits related to this issue

Version 2.2-beta HW Rev 2: - Spilt program in multiple tasks, this makes it easier to spot issues - Stale tasks at least should result in a log entry - Include reset reason in human readable format ... — committed to arjenhiemstra/ithowifi by arjenhiemstra 3 years ago
[MQTT] PubSubClient checks when callback + publish happen simultaneous After discussion with @arjenhiemstra on issue: https://github.com/knolleary/pubsubclient/issues/832 — committed to TD-er/ESPEasy by TD-er 3 years ago
[MQTT] PubSubClient checks when callback + publish happen simultaneous After discussion with @arjenhiemstra on issue: https://github.com/knolleary/pubsubclient/issues/832 — committed to TD-er/ESPEasy by TD-er 3 years ago
Resolve issue #832 (Potential memory corruption) The memmove command is executed without any sanity checking on used values causing random reboots when an MQTT package gets processed with an impossib... — committed to arjenhiemstra/pubsubclient by arjenhiemstra 3 years ago

Most upvoted comments

I have been fighting with this issue as well for my ESP32 projects (random crashes and hanging code). Finally I boiled it down to getting extra 0xff bytes in the incoming TCP stream from my WiFiClient from time to time. The problem is on the TCP layer or even below, not with PubSubClient. I successfully reproduced this just with a TCP connection w/o any functionality above, but I have no idea yet where the extra bytes are coming from.

As a consequence, PubSubClient runs into trouble when decoding such packets, especially when it comes to package lengths. I identified four points where that happens:

PubSubClient::readPacket(), line 323, if (len == 5) … : The case of >=5 bytes for the packet length is already captured by the existing PubSubClient code. Good.
PubSubClient::readPacket(), after line 334: Even for <=4 bytes for the packet length, the encoded length can be way too large. Sometimes I got lengths of >100MiB. The subsequent read operations then wait for that amount of data, causing the hanging code.
PubSubClient::loop(), readPacket() in line 389 can return a length that is too big.
PubSubClient::loop(), line 389, memmove: This is the issue already mentioned by @arjenhiemstra above. I found cases where memmove was writing >14KiB out of buffer, thus corrupting the heap and causing random crashes.

Case 1 is already handled by disconnecting the MQTT connection. I guess this is the best way to deal with a TCP stream that is out of sync. After reconnecting, we are back to normal. I modified the PubSubClient code to handle the other 3 cases similarly by disconnecting. And voila: Lots of disconnects, but no more random crashes and hanging code so far! (Tested by hammering the client, just like Arjen did. And running in production code for 1/2 year.)

These issues do not happen with correct packets. However, I think this should be addressed anyhow. Bad guys might even exploit this by intentionally sending malformed packets…

I prepared PR #843 including some debugging code. Anybody willing to crosscheck, maybe @genotix or @TD-er ?

marcel7234 on Mar 20, 2021

@arjenhiemstra : Funny! Same ideas, same timing. I like that!

marcel7234 on Mar 21, 2021

The Arduino ESP32 framework is using FreeRTOS, see: https://github.com/espressif/arduino-esp32/blob/8dc70e0add06e75da5a255ba94fe86df98ae5ef1/cores/esp32/main.cpp#L32 where the main looptask is created. If you want to measure other tasks you could replace the NULL in this statement:

uint32_t memHighLevelMark = uxTaskGetStackHighWaterMark( NULL );

with the task’s handle. For the Arduino task this handle is defined here: https://github.com/espressif/arduino-esp32/blob/8dc70e0add06e75da5a255ba94fe86df98ae5ef1/cores/esp32/main.cpp#L10 So for the arduino task this would become:

uint32_t memHighLevelMark = uxTaskGetStackHighWaterMark( loopTaskHandle );

If your lib uses tasks and defines a task handle, you could use this handle to specifically measure that task.

arjenhiemstra on Mar 21, 2021

Sometimes things just start to move along.

The dev over SSLClient just indicated an incoming modification; he has found the issue: https://github.com/OPEnSLab-OSU/SSLClient/issues/30#issuecomment-803735657

Aside from this I think I might have found an alternative to the SSLClient mentioned above: https://github.com/govorox/SSLClient which uses mbedTLS instead of BearSSL

genotix on Mar 22, 2021

Unhandled debug exception

This will usually be followed by a message like:

Debug exception reason: Stack canary watchpoint triggered (task_name) Copy to clipboard This error indicates that application has written past the end of the stack of task_name task. Note that not every stack overflow is guaranteed to trigger this error. It is possible that the task writes to stack beyond the stack canary location, in which case the watchpoint will not be triggered.

—- So it seems you have ran out of memory

arjenhiemstra on Mar 21, 2021

You should make sure that a callback due to a MQTT message coming in never can occur at the same time a publish is underway. Do not use ISR for callback and indeed as @marcel7234 states, try to do every thing from a main loop. If you are using tasks it could be an option to protect parts of your code using a mutex lock. Nice recent vid from digi-key about this subject: https://www.youtube.com/watch?v=I55auRpbiTs

arjenhiemstra on Mar 21, 2021

Sounds great @marcel7234 !!! I will try to test this tonight (if my wife admits)!

if we get this stuff stable we should do an online beer session; my treat! I have been struggling with these random errors for over 2 weeks now.

genotix on Mar 20, 2021

@arjenhiemstra happy to review a PR with those improvements in given it sounds like it’s been thoroughly tested - far more so than I would be able to do.

knolleary on Feb 26, 2021

A big shoutout to @marcel7234 and @arjenhiemstra 's efforts on this topic though! I think we have all been banging our head into this topic but i surely see improvement in this library.

Thank you, guys!

genotix on Mar 21, 2021

Well, the operating system behind your Arduino sketches is FreeRTOS. Hence you are using it.

Have a look at the documentation of uxTaskGetStackHighWaterMark(): The stack used by a task will grow and shrink as the task executes and interrupts are processed. uxTaskGetStackHighWaterMark() returns the minimum amount of remaining stack space that was available to the task since the task started executing - that is the amount of stack that remained unused when the task stack was at its greatest (deepest) value. This is what is referred to as the stack 'high water mark'. So you get the full picture of your complete task: minimum stack size achieved so far. I guess it is sufficient to focus on your own task for now. Very likely that’s where the problem is.

marcel7234 on Mar 21, 2021

Perfectly aligned, @arjenhiemstra ! 😃

marcel7234 on Mar 21, 2021

timing @marcel7234 😉 lol!

arjenhiemstra on Mar 21, 2021

That’s the canary again. You should figure out the reason for your stack overflow, that’s highest priority. Functions like uxTaskGetStackHighWaterMark() can be handy to identify a growing stack.

marcel7234 on Mar 21, 2021

That is to be expected because the de kernel panic indicates you have ran out of memory. Stack memory to be precise. That most probably means the task you are running has no more stack memory left. You can find the current task’s high level memory mark with this statement within the task you want to measure.

uint32_t memHighLevelMark = uxTaskGetStackHighWaterMark( NULL ); (and then log it or print it to serial for example)

Be aware that the arduino code (loop() ) is also running in a task. This task has standard 8k of stack memory, if you are using non dynamic allocated mem for SSL, a big MQTT buffer etc. this could cause to running out of mem quite fast.

arjenhiemstra on Mar 21, 2021

Currently at 5000+ messages; going strong. I’m hoping the issue I am having is a “Read Buffer Overflow” as stated here.

My focus is currently on that anyway since Marcels code seems to respond more stable than the original and also the fix that @arjenhiemstra did (which already improved the connection -A LOT-) I really think I’m seeing a huge improvement. (at least specifically for my case)

Thumbs up and will keep you posted!

genotix on Mar 21, 2021

@genotix I think it is even more important to note the crash reason as well.

Also a bit more context may be needed to interpret the crash. For example if you’re synchronizing data among tasks you may need to have locking or using semaphores to make sure the data used is still there while it is being sent.

TD-er on Mar 21, 2021

I am using SSLClient for a TLS 1.2 connection to AWS IoT Core so this could be related to that, just how to figure it out… My connection currently is a SIM800L GPRS modem but I could switch to Wifi.

Just for addition; I’m transmitting between 1 and 3 seconds and the payload doesn’t differ unless the system had to do any kind of reconnect.

My setBufferSize() is differing but shouldn’t hit far over 2Kb. (I have a partly Base64 payload so to be efficient I calculate the max payload size on the post Base64 processing size)

I received a crash after about 2500 messages:

Backtrace: 0x7ac9d24a:0x3ffbeff0 0x4008da48:0x3ffbf010 0x4008c5ef:0x3ffbf030 0x4008e8f5:0x3ffbf050 0x40086ef2:0x3ffbf060 0x400ea08f:0x3ffb1c70 0x400ea0d0:0x3ffb1c90 0x400ea00f:0x3ffb1d00 0x400e9b88:0x3ffb1d50 0x400e9e11:0x3ffb1d80 0x4010f88e:0x3ffb1dd0 0x400e7cd1:0x3ffb1e00 0x400e8025:0x3ffb1e20 0x400e1028:0x3ffb1e40 0x400ebf31:0x3ffb1e80 0x400ec27d:0x3ffb1ea0 0x400ec2a1:0x3ffb1ec0 0x400d3861:0x3ffb1ee0 0x400d722c:0x3ffb1f00 0x400d72ec:0x3ffb1f40 0x400d90bd:0x3ffb1f90 0x400ef701:0x3ffb1fb0 0x4008c502:0x3ffb1fd0

Decoded stack results

0x4008da48: xTaskIncrementTick at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/tasks.c line 2473
0x4008c5ef: xPortSysTickHandler at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c line 289
0x400ea08f: sub_bytes at /Users/genotix/Documents/Arduino/libraries/SSLClient/src/bearssl/src/symcipher/aes_small_enc.c line 51
0x400ea0d0: br_aes_small_encrypt at /Users/genotix/Documents/Arduino/libraries/SSLClient/src/bearssl/src/symcipher/aes_small_enc.c line 118
0x400ea00f: br_aes_small_ctr_run at /Users/genotix/Documents/Arduino/libraries/SSLClient/src/bearssl/src/symcipher/aes_small_ctr.c line 62
0x400e9b88: do_ctr at /Users/genotix/Documents/Arduino/libraries/SSLClient/src/bearssl/src/ssl/ssl_rec_gcm.c line 118
0x400e9e11: gcm_encrypt at /Users/genotix/Documents/Arduino/libraries/SSLClient/src/bearssl/src/ssl/ssl_rec_gcm.c line 206
0x4010f88e: sendpld_flush at /Users/genotix/Documents/Arduino/libraries/SSLClient/src/bearssl/src/ssl/ssl_engine.c line 851
0x400e7cd1: sendpld_ack at /Users/genotix/Documents/Arduino/libraries/SSLClient/src/bearssl/src/ssl/ssl_engine.c line 876
0x400e8025: br_ssl_engine_sendapp_ack at /Users/genotix/Documents/Arduino/libraries/SSLClient/src/bearssl/src/ssl/ssl_engine.c line 1133
0x400e1028: SSLClient::write(unsigned char const*, unsigned int) at /Users/genotix/Documents/Arduino/libraries/SSLClient/src/SSLClient.cpp line 117
0x400ebf31: PubSubClient::write(unsigned char, unsigned char*, unsigned short) at /Users/genotix/Documents/Arduino/libraries/pubsubclient-MarcelTcpFix/src/PubSubClient.cpp line 699
0x400ec27d: PubSubClient::publish(char const*, unsigned char const*, unsigned int, bool) at /Users/genotix/Documents/Arduino/libraries/pubsubclient-MarcelTcpFix/src/PubSubClient.cpp line 569
0x400ec2a1: PubSubClient::publish(char const*, char const*) at /Users/genotix/Documents/Arduino/libraries/pubsubclient-MarcelTcpFix/src/PubSubClient.cpp line 537
0x400d3861: MQTTPublish(char*, char*, unsigned int) at /var/folders/6b/0q6s_gw93sz5c0_qjptlq4pc0000gn/T/arduino_build_211585/sketch/mqtt.h line 167
0x400d722c: TransmitMessage(char*, bool, bool) at /var/folders/6b/0q6s_gw93sz5c0_qjptlq4pc0000gn/T/arduino_build_211585/sketch/queue.h line 76
0x400d72ec: processQueue(bool) at /var/folders/6b/0q6s_gw93sz5c0_qjptlq4pc0000gn/T/arduino_build_211585/sketch/queue.h line 126
0x400d90bd: loop() at /Users/genotix/Desktop/IoT/IoT_Serial_Cloud_Printer/Code/Application/TCALL-800_WIFI_SSLClient_MQTT_AWS_v1.89/TCALL-800_WIFI_SSLClient_MQTT_AWS_v1.89.ino line 152
0x400ef701: loopTask(void*) at /Users/genotix/Library/Arduino15/packages/esp32/hardware/esp32/1.0.5/cores/esp32/main.cpp line 37
0x4008c502: vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c line 143

genotix on Mar 21, 2021

Very nice analysis and great work @marcel7234!! I’ll try to test the code as well!

arjenhiemstra on Mar 20, 2021

@genotix great this helped you as well!

With this check the device doesn’t corrupt the memory anymore but doesn’t fix the underlying issue (can still cause timeouts) With the help of TD-er and a lot of testing the loop code was even further improved. Have a look at: https://github.com/arjenhiemstra/PubSubClientStatic

Tested with rate of about 50.000 mqtt messages per hour and is rock steady stable now!

arjenhiemstra on Feb 26, 2021