nodemcu-firmware: Error Handling in Lua

#3075 introduced the new luaL_callx() to facilitate improved Panic Handling. I want to use this issue to discuss error handling in more general terms and why the focus on panic handling.

Caught Errors. The Lua architecture allow the application so establish an error handler via the pcall() and xpcall(), thus creating a protected environment for executing a code hierachy. In such cases the error is returned to the application and the application code itself determines how to process or log the error.
Uncaught Errors. Any Lua code executed outside such a protected environment results in a Lua panic, which in standard Lua coding typically terminates the application. In the case of NodeMCU, this triggers a processor restart.

One of the main issues that catches new developers is that whilst the interactive thread established a default protected environment to catch and print errors interactively, any Lua callback runs as separate execution thread and therefore is not protected; any errors here will panic and reboot the processor. The issue that I want to consider here is whether this is the correct behaviour for NodeMCU and if not then how we can improve this in a way which doesn’t break existing applications.

We have three broad mechanisms for reporting errors in the firmware:

Print direct to the UART. We adopt this approach during Lua startup because this is the simplest and most robust mechanism that avoids dependencies on services that might not be yet running.
Print the error using the standard print() function. Prior to SDK 3.0 releases, we always did (1) but this meant that errors generated during node.output() where reported only on the UART. print() now sends output to the STDOUT which is in turn emptied to the UART or the node.output() reader using the Pipe module, and this enables error reporting in a terminal session to occur through the session.
Use a separate panic process manger to process uncaught errors. The problem with (2) during panics is that a simple “print error and restart” will queue the error message to STDOUT then restart the processor before the printing task has executed. At a minimum, the restart operation should wait until the STDOUT pipe is empty (or a maximum time elapsed) before restarting the ESP.

In case (3) the default panic process should still print the error, but at least wait until the STDOUT pipe is empty or a maximum time elapsed before restarting the ESP. We should also provide a node.setonerror() call to enable the application to override this default (say to add some form of error logging).

Scope of Work

So I suggest that a good next PR would be to tidy up such error handling:

Review Lua code for complaince to above (1)-(3) separation.
Modify node.restart() functionality to wait until STDOUT or N sec elapsed before restarting ESP. Add extra delay boolean parameter to enable this mode and document this. Q: Should this wait until errors flushed be thedefault?
Add node.setonerror() call and document.
Modify all module CBs to use luaL_pcallx()
~~Modify debug.debug() function to read from STDIN pipe, and output to print() (Maybe a separate PR.)~~

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 26 (7 by maintainers)

Most upvoted comments

@nwf wrote:

@TerryE: You may as well land the whole bunch in a single PR and we can go from there. I think we are currently obligated to not merge anything to dev so that we can cut master soon.

I will formulate the PR on this basis. Also since I am making changes to lauxlib and the modules to add the luaL_pallx() calls and l, I will also add the luaL_unref2() API as discussed here in #1028 and tidy up tmr.c as a worked example.

TerryE on May 13, 2020

OK, that works.

> a={}  collectgarbage();  print(node.heap())
40752
> tmr:create():alarm(500,0,f)
> E:M 32784
E:M 32784
out of memory

 ets Jan  8 2013,rst cause:2, boot mode:(3,6)
...

It’s a feature of Lua that event handlers are not called on “out of memory” conditions, but at least node.setonerror() is still honoured, so if you do the same with it set to print then you have lost the call stack, but you can still try to diagnose what triggered the exhaustion at the interactive prompt.

PS: the above diagnostics were with Lua 5.3, but its the same with Lua 5.1.

TerryE on May 14, 2020

@nwf One note whilst I think on. We don’t want to convert all lua_call() instances, just those (and in the majority) that are executed via a SDK callback, that is those where the execution path uses lua_getstate() to acquire the lua state. The remainder (for example in sjson) are invoked from the Lua application and may already be protected by a pcall() so we don’t want to turn these into a panic.

TerryE on May 14, 2020