async-mqtt-client: Serious and fatal problem with entire library

This library has a fundamental problem that makes me wonder how it ever worked at all. There is even a comment in the code which led me to the surprising discovery that the code is simply not “fit for purpose”. Of course I have concrete proof and an explanation of that statement which I present below. It also explains why ever since I have started using it, I have noticed “random” periods of disconnection / reconnection loops, so it actually doesn’t work. As soon as I looked “under the hood” I could immediately see why.

This comment alerted me:

// Using a sendbuffer to fix bug setwill on SSL not working

The irony of this is that there is nothing wrong with either setwill or SSL or even the two combined. Certainly there is nothing wrong with LwIP that caused it. The problem is with your code and it is only coincidentally that the setwill issue showed it up. The same problem still exists in 90% of the rest of your library, and it is easy to explain.

The problem

ESPAsyncTCP is a thin wrapper arounf LwIP. LwIP client calls take char* parameters for “strings” e.g. clientID, topic etc. Thus of course, so does ESPAsyncTCP which effectively just “forwards” them to LwIP.

LwIP has many optimisations to improve throughput and performance, the main ones being buffering, queuing and coalescing of data. This is why there are two calls to send data rather than the more usual send found in synchronous aka “blocking” code.

The add function simply tells LwIP to buffer the data. If it doesn’t get any more within a given timeframe, it may then send the data. If gets more adds in rapid succession, it lumps them all together until…

The send function effectively tells LwIP “no more data from me for a while”. At this point LwIP still may not actually put it “on the wire” - it has some complex algorithms - but the main point is that your code has no control over when LwIP finally decides to do that. It could be up to several seconds after you call send and long after your function has exited…invalidating all your temporary variables as it does so.

Any code (not just this egregious example) that sends a temporary stack pointer (aka “local variable” address) data to a delayed function a) is asking for trouble b) has no concept of “asynchronous” c) breaks every rule of programming from chapter 1 of any book on C d) will run only by pure luck, if at all.

ALL the data you send to LwIP via ESPAsyncTCP are temporary stack pointers.

Example

ESPAsyncTCP add prototype takes a char*

size_t add(const char* data, size_t size, uint8_t apiflags=0);//add for sending

It then passes that data * to LwIP

Code such as the next snippet (which is 90% of your library ‘style’) is going to fail. Not “might” or “could” but “absolutely and definitely will” . Your setwill / SSL already proves that, as do the frequent “random” disconnects.

scope / lifetime: stack variables

// @ global scope
AsyncClient*  client=new AsyncClient();
void willCrash(){
  char[] shortlifetime="disaster waiting to happen";
  client->add(shortlifetime);
 client->send(); // at an undefined time posibly seconds into the future
} // char array "shortlifetime just disappeared out-of-scope
// yet you just passed it ot a lower level function = disaster

The ONLY reason this works at all on ESP8266 is that by happy accident and accident alone, the stack hasn’t (yet) been trashed by the time LwIP does its actual send - which could be several seconds after your function has exited and released/destroyed and thus invalidated every pointer you just passed down the chain.

Whether, several seconds later the data bears any resemblance at all to the former values is a function of the number and size of LwIP buffers/queue length, the MCU clock, the latency of the network, the underlying ‘system’ code, interrupts etc etc etc -none of which you have any control over. Change any one of them and your code will crash at some point - as it currently does, regularly.

One of those “unlucky” “random” events just so happens to occur when setwill was used. Whatever you think you did to fix just slightly tweaked that list of uncontrollables above and accidentally hid it - till the next time

Your library is basicially a whole collection of code written excusively in the style of the above snippet. It needs a ground-up rewrite to hold those pointers until the final ACK is received …or make all variables paased to ESPAsyncTCP -> LwIP into globals.

And just in case you still don’t beleive me, take it from the LwIP documentation: image

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 8
  • Comments: 21

Commits related to this issue

Most upvoted comments

I welcome all contributors here to try my alternative library which I have just finished writing which fixes all of these problems - at least on my test system 😃

WITH THE WARNING that is is very very “alpha” and likely to contain “obvious” early bugs. Also, it does not yet handle TLS, nor have I tested it on ESP32 - I’m working on those right now and they will be added soon. I see no reason why it wouldn’t / shouldn’t drop straight in to ESP32 as long as no-one tries to do any fancy RTOS multiple task access and only uses it one one “thread” (I can’t think why anyone would want to do otherwise 😃

ALSO it correctly handles binary payloads (up to the TCP_MSS size (ish)) which this lib does not ( you’d need to understand the difference between a char* and a uint8_t* and between a “string” and a “lump of arbitrary data” to do that. In my opinion, this author does not). Given that ALL MQTT payloads are length-specified binary blocks (!!!) and not strings or “strings” or Strings or sequences of null-terminated printable characters…minor changes are need to the API to correct those exisiting design errors.

Notice how the examples never print the contents of the payload? I have included a hex dumper for that purpose and am currently adding api calls to take bare char* for anyone brave enough to want to use them and who also understand scope / lifetime and C pointers, plus overloads for std:::string and also String for die-hard Arduino fans.

My lib correctly implements session control and QoS2 retransmits after failure, which again, this lib does not.

Once its been alpha’d, I will add TCP large packet reconstruction “behind the scenes” - this lib expects the user to manage packet reassembly when payload > TCP_MSS!! To me, that says a lot about this lib, and none of it is good…I dispute the “lightweight” argument - the current lib has over 1800 lines of code, and although mine has the extras still to be added as I mentioned earlier, I don’t expect it to get much bigger than the 622 lines it currently is… The difference in those two figures alone speaks volumes.

Any help testing this alternative will be greatly welcomed: https://github.com/philbowles/asyncmqtt

I have been working successfully with a slightly modified code of PangolinMQTT for a couple of months now.

It also compiles

  • with the -wall compiler option used by Ardiuno IDE if switched on:

werror 002

  • and with PlatformIO as well, that uses the -wall option by default

Also some improvements for Keep-Alive handling are added.

To make it available for all I just created a fork https://github.com/obrain17/PangolinMQTT and uploaded my changes

Closing this issue. It’s being worked on.

I must admit, the code in this lib is very readable. There are other libs where you can only guess the purpose of variables.

I’ll switch gladly. Then I can ditch my own queueing code.

I don’t agree with all of the issues you mention though (when talking about esp8266 only).

And yes, using strings (whatever type of) as payload seems to confuse a lot of people.