esp-idf: Modbus unexpected error flow (IDFGH-3829)

Environment

Development Kit: NA
Kit version NA
Module or chip used: ESP32-WROOM-32D
IDF version (run git describe --tags to find it): v4.1-rc
Build System: CMake
Compiler version xtensa-esp32-elf-gcc (crosstool-NG esp-2020r2) 8.2.0
Operating System: Windows
(Windows only) environment type: MSYS2 mingw32
Using an IDE?: Yes (eclipse)
Power Supply: NA

Problem Description

After upgrade to tag v4.1-rc, Modbus communication sporadically encounters errors that modifies SW flow. Although this is not confirmed yet, this new error seems to cause desynchronization among exchanged Modbus frames. Indeed, we’ve encountered multiple “resource release failure” or “Take resource failure” messages that causes SW fails in our application. So we’ve tried to investigate by adding more log, and we’ve found a weird behavior on error management, please read below.

In function eMBMasterPoll, checking EV_MASTER_EXECUTE: if eException is raised, xMBMasterPortEventPost modifies code execution strangely, cf. log sample below and instrumented function eMBMasterPoll in attachement.

Expected Behavior

Expected log in case of exception would be the following, based on instrumented file (this “good” error management behavior has been observed somewhere else during our logs):

MB_PORT_COMMON: ERR 3
MB_PORT_COMMON: ERR 4
MB_PORT_COMMON: error type = 2
MB_PORT_COMMON: error step 1
MB_PORT_COMMON: error step 2

Actual Behavior

Actual log in case of exception is the following, based on instrumented file:

(13164) MB_PORT_COMMON: ERR 3
(13186) MB_PORT_COMMON: error type = 1 // Error type gets modified here: otherwise, we should have had type 2
(13198) MB_PORT_COMMON: error step 1
(13206) MB_PORT_COMMON: error step 2
(13385) MB_PORT_COMMON: eMBMasterPoll: Unprocessed event 4

ERR 4 gets never displayed at all, and unprocessed event makes me think that we jump somewhere back to the function start when doing xMBMasterPortEventPost( EV_MASTER_ERROR_PROCESS ), without displaying ERR 4 and clearing event 4. After checking the code, I would say eMBMasterPoll gets interrupted by xMBMasterRTUTimerExpired between display of ERR3 and 4, and that eMBMasterCurErrorType gets modified to 1 by vMBMasterSetErrorType(EV_ERROR_RECEIVE_DATA) in xMBMasterRTUTimerExpired.

I don’t know whether it is expected or not, but I’m expecting maybe a real time scheduling issue… You will find our SDKconfig attached. Especially, do you have a requirement on FreeRTOS timer daemon task priority to handle Modbus event groups ? So far it has the lowest priority in our project.

Steps to reproduce

Our setup is the following:

board 1 (master) sends periodic requests to board 2 (slave)
board 1 and board 2 are under OTA process, and can reboot asynchronously (I mean in the middle of a Modbus request/response for example).

With the previous version, we did not have any issue. But now communication seems to be messed up after multiple OTA cycles. The problem does not appear systematically though? Although, we can’t log the Modbus exchanged frames. So far, we don’t have any ideas on how to reproduce this easily. But we always end up there after roughly 15/20 OTAs…

What I want to point out is the error message: we’ve never have it before, that’s why I suspect something new that could desynchronize.

Thank you for your help.

// If possible, attach a picture of your setup/wiring here.

Code to reproduce this issue

Debug Logs

Cf. Actual Behavior

Other items if possible

SDKconfig attached as JPG otherwise GitHub rejects it…

sdkconfig eMBMasterPoll.txt

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 32

Commits related to this issue

bugfix: Espressif workaround proposal for Modbus latency decrease during OTAs Jira: HPC-20 - applied Espressif workaround proposed for Modbus latency reduction during OTA and other blocking process (... — committed to BadrBouaddi/esp-idf by simon-thiebaut 4 years ago
bugfix: Espressif workaround proposal for Modbus latency decrease during OTAs Jira: HPC-20 - applied Espressif workaround proposed for Modbus latency reduction during OTA and other blocking process (... — committed to BadrBouaddi/esp-idf by simon-thiebaut 4 years ago

Most upvoted comments

Hi @DaniusKalv and @alisitsyn ,

Ok so actually I’ve not created the PR yet, because we still have an issue regarding the 2nd point:

“normal long term running: we have currently 14 devices running for more than a week without troubles. Still to monitor over time, but encouraging.”

Devices were not monitored properly. After a new diagnostic, it appears that we have roughly 1 communication loss per day with the new stack. It is not the case with the previous stack (we compare also devices with this old stack). After analysis, it appears there is one scenario in the new stack where the client is not notified of an error, and can be stuck in the request finish waiting API - considered as a “communication freeze”. I’ve patched it last week, then we’ve restarted the tests. We have a meeting this afternoon to check the test results.

If they are OK, I will add this patch before creating the PR.

Sorry again for the delay, I’ll keep you updated.

Thank you,

Best regards,

simon-thiebaut on Nov 30, 2020

Hello @DaniusKalv and @alisitsyn ,

Sorry for delay, last week was a delivery week…

So for the test results:

we made 200 OTA loops, that causes the exchange of roughly 567000 chunks of 64 data bytes over Modbus with blocking delay close to the Modbus timeout => device is no more stuck
normal long term running: we have currently 14 devices running for more than a week without troubles. Still to monitor over time, but encouraging.

FYI, before OTA loops were failing after ~15 loops, and devices stopped to run after ~3 days.

OK I will do the PR, but probably tomorrow or on wednesday.

Thank you.

Best regards,

simon-thiebaut on Nov 23, 2020