nodemcu-firmware: Error Handling in Lua
#3075 introduced the new luaL_callx()
to facilitate improved Panic Handling. I want to use this issue to discuss error handling in more general terms and why the focus on panic handling.
- Caught Errors. The Lua architecture allow the application so establish an error handler via the
pcall()
andxpcall()
, thus creating a protected environment for executing a code hierachy. In such cases the error is returned to the application and the application code itself determines how to process or log the error. - Uncaught Errors. Any Lua code executed outside such a protected environment results in a Lua panic, which in standard Lua coding typically terminates the application. In the case of NodeMCU, this triggers a processor restart.
One of the main issues that catches new developers is that whilst the interactive thread established a default protected environment to catch and print errors interactively, any Lua callback runs as separate execution thread and therefore is not protected; any errors here will panic and reboot the processor. The issue that I want to consider here is whether this is the correct behaviour for NodeMCU and if not then how we can improve this in a way which doesn’t break existing applications.
We have three broad mechanisms for reporting errors in the firmware:
- Print direct to the UART. We adopt this approach during Lua startup because this is the simplest and most robust mechanism that avoids dependencies on services that might not be yet running.
- Print the error using the standard
print()
function. Prior to SDK 3.0 releases, we always did (1) but this meant that errors generated duringnode.output()
where reported only on the UART.print()
now sends output to theSTDOUT
which is in turn emptied to the UART or thenode.output()
reader using the Pipe module, and this enables error reporting in a terminal session to occur through the session. - Use a separate panic process manger to process uncaught errors. The problem with (2) during panics is that a simple “print error and restart” will queue the error message to STDOUT then restart the processor before the printing task has executed. At a minimum, the restart operation should wait until the STDOUT pipe is empty (or a maximum time elapsed) before restarting the ESP.
In case (3) the default panic process should still print the error, but at least wait until the STDOUT pipe is empty or a maximum time elapsed before restarting the ESP. We should also provide a node.setonerror()
call to enable the application to override this default (say to add some form of error logging).
Scope of Work
So I suggest that a good next PR would be to tidy up such error handling:
- Review Lua code for complaince to above (1)-(3) separation.
- Modify
node.restart()
functionality to wait until STDOUT or N sec elapsed before restarting ESP. Add extradelay
boolean parameter to enable this mode and document this. Q: Should this wait until errors flushed be thedefault? - Add
node.setonerror()
call and document. - Modify all module CBs to use
luaL_pcallx()
Modifydebug.debug()
function to read from STDIN pipe, and output toprint()
(Maybe a separate PR.)
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 26 (7 by maintainers)
@nwf wrote:
I will formulate the PR on this basis. Also since I am making changes to lauxlib and the modules to add the
luaL_pallx()
calls and l, I will also add theluaL_unref2()
API as discussed here in #1028 and tidy uptmr.c
as a worked example.OK, that works.
It’s a feature of Lua that event handlers are not called on “out of memory” conditions, but at least
node.setonerror()
is still honoured, so if you do the same with it set toprint
then you have lost the call stack, but you can still try to diagnose what triggered the exhaustion at the interactive prompt.PS: the above diagnostics were with Lua 5.3, but its the same with Lua 5.1.
@nwf One note whilst I think on. We don’t want to convert all
lua_call()
instances, just those (and in the majority) that are executed via a SDK callback, that is those where the execution path useslua_getstate()
to acquire the lua state. The remainder (for example insjson
) are invoked from the Lua application and may already be protected by apcall()
so we don’t want to turn these into a panic.