esp-idf: Modbus unexpected error flow (IDFGH-3829)
Environment
- Development Kit: NA
- Kit version NA
- Module or chip used: ESP32-WROOM-32D
- IDF version (run
git describe --tagsto find it): v4.1-rc - Build System: CMake
- Compiler version xtensa-esp32-elf-gcc (crosstool-NG esp-2020r2) 8.2.0
- Operating System: Windows
- (Windows only) environment type: MSYS2 mingw32
- Using an IDE?: Yes (eclipse)
- Power Supply: NA
Problem Description
After upgrade to tag v4.1-rc, Modbus communication sporadically encounters errors that modifies SW flow. Although this is not confirmed yet, this new error seems to cause desynchronization among exchanged Modbus frames. Indeed, we’ve encountered multiple “resource release failure” or “Take resource failure” messages that causes SW fails in our application. So we’ve tried to investigate by adding more log, and we’ve found a weird behavior on error management, please read below.
In function eMBMasterPoll, checking EV_MASTER_EXECUTE: if eException is raised, xMBMasterPortEventPost modifies code execution strangely, cf. log sample below and instrumented function eMBMasterPoll in attachement.
Expected Behavior
Expected log in case of exception would be the following, based on instrumented file (this “good” error management behavior has been observed somewhere else during our logs):
MB_PORT_COMMON: ERR 3
MB_PORT_COMMON: ERR 4
MB_PORT_COMMON: error type = 2
MB_PORT_COMMON: error step 1
MB_PORT_COMMON: error step 2
Actual Behavior
Actual log in case of exception is the following, based on instrumented file:
(13164) MB_PORT_COMMON: ERR 3
(13186) MB_PORT_COMMON: error type = 1 // Error type gets modified here: otherwise, we should have had type 2
(13198) MB_PORT_COMMON: error step 1
(13206) MB_PORT_COMMON: error step 2
(13385) MB_PORT_COMMON: eMBMasterPoll: Unprocessed event 4
ERR 4 gets never displayed at all, and unprocessed event makes me think that we jump somewhere back to the function start when doing xMBMasterPortEventPost( EV_MASTER_ERROR_PROCESS ), without displaying ERR 4 and clearing event 4. After checking the code, I would say eMBMasterPoll gets interrupted by xMBMasterRTUTimerExpired between display of ERR3 and 4, and that eMBMasterCurErrorType gets modified to 1 by vMBMasterSetErrorType(EV_ERROR_RECEIVE_DATA) in xMBMasterRTUTimerExpired.
I don’t know whether it is expected or not, but I’m expecting maybe a real time scheduling issue… You will find our SDKconfig attached. Especially, do you have a requirement on FreeRTOS timer daemon task priority to handle Modbus event groups ? So far it has the lowest priority in our project.
Steps to reproduce
Our setup is the following:
- board 1 (master) sends periodic requests to board 2 (slave)
- board 1 and board 2 are under OTA process, and can reboot asynchronously (I mean in the middle of a Modbus request/response for example).
With the previous version, we did not have any issue. But now communication seems to be messed up after multiple OTA cycles. The problem does not appear systematically though? Although, we can’t log the Modbus exchanged frames. So far, we don’t have any ideas on how to reproduce this easily. But we always end up there after roughly 15/20 OTAs…
What I want to point out is the error message: we’ve never have it before, that’s why I suspect something new that could desynchronize.
Thank you for your help.
// If possible, attach a picture of your setup/wiring here.
Code to reproduce this issue
NA
Debug Logs
Cf. Actual Behavior
Other items if possible
SDKconfig attached as JPG otherwise GitHub rejects it…
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 32
Commits related to this issue
- bugfix: Espressif workaround proposal for Modbus latency decrease during OTAs Jira: HPC-20 - applied Espressif workaround proposed for Modbus latency reduction during OTA and other blocking process (... — committed to BadrBouaddi/esp-idf by simon-thiebaut 4 years ago
- bugfix: Espressif workaround proposal for Modbus latency decrease during OTAs Jira: HPC-20 - applied Espressif workaround proposed for Modbus latency reduction during OTA and other blocking process (... — committed to BadrBouaddi/esp-idf by simon-thiebaut 4 years ago
Hi @DaniusKalv and @alisitsyn ,
Ok so actually I’ve not created the PR yet, because we still have an issue regarding the 2nd point:
“normal long term running: we have currently 14 devices running for more than a week without troubles. Still to monitor over time, but encouraging.”
Devices were not monitored properly. After a new diagnostic, it appears that we have roughly 1 communication loss per day with the new stack. It is not the case with the previous stack (we compare also devices with this old stack). After analysis, it appears there is one scenario in the new stack where the client is not notified of an error, and can be stuck in the request finish waiting API - considered as a “communication freeze”. I’ve patched it last week, then we’ve restarted the tests. We have a meeting this afternoon to check the test results.
If they are OK, I will add this patch before creating the PR.
Sorry again for the delay, I’ll keep you updated.
Thank you,
Best regards,
Hello @DaniusKalv and @alisitsyn ,
Sorry for delay, last week was a delivery week…
So for the test results:
FYI, before OTA loops were failing after ~15 loops, and devices stopped to run after ~3 days.
OK I will do the PR, but probably tomorrow or on wednesday.
Thank you.
Best regards,