esp-idf: [ESP-MESH] Wifi scan triggers infinite loop with event (IDFGH-8497)
Answers checklist.
- I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
- I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
- I have searched the issue tracker for a similar issue and not found a similar issue.
IDF version.
v4.4.2-381-gc97755c9ae5
Operating System used.
Windows
How did you build your project?
VS Code IDE
If you are using Windows, please specify command line type.
PowerShell
Development Kit.
ESP32S3-WROOM N8R2
Power Supply used.
USB
What is the expected behavior?
the idea is quite simple: when the mesh root loses the AP (AP goes offline), we change the router of the esp mesh network to NULL, in order to allow a routerless network. We then recover the AP with periodic scans
Standard config
mesh_cfg_t cfg = MESH_INIT_CONFIG_DEFAULT();
// mesh ID
memcpy((uint8_t*)&cfg.mesh_id, config_net->mesh_id, MAC_ADDR_LEN);
// router
cfg.channel = config_net->channel;
cfg.allow_channel_switch = true;
cfg.router.ssid_len = strlen((const char*)config_net->ssid);
memcpy((uint8_t*)&cfg.router.ssid, config_net->ssid, cfg.router.ssid_len);
memcpy((uint8_t*)&cfg.router.password, config_net->password, strlen((const char*)config_net->password));
// mesh softAP
ESP_ERROR_CHECK(esp_mesh_set_ap_authmode(self->config_mesh.ap_auth_mode));
cfg.mesh_ap.max_connection = self->config_mesh.ap_max_connections;
cfg.mesh_ap.nonmesh_max_connection = self->config_mesh.ap_nonmesh_max_connections;
memcpy((uint8_t*)&cfg.mesh_ap.password, config_net->mesh_password, strlen((const char*)config_net->mesh_password));
// Actualize the settings
ESP_ERROR_CHECK(esp_mesh_set_config(&cfg));
// Store the router info for hybrid fixed root meshing
memcpy(&self->mesh_router, &cfg.router, sizeof(mesh_router_t));
Handling of the disconnection from the AP, run after 30 seconds the device does not reconnect.
if (esp_mesh_get_self_organized()) {
esp_mesh_set_self_organized(false, false);
if (device_to_be_root()) {
ESP_ERROR_CHECK_WITHOUT_ABORT(esp_mesh_set_type(MESH_ROOT));
mesh_router_t no_router = { 0 };
ESP_ERROR_CHECK_WITHOUT_ABORT(esp_mesh_set_router(&no_router));
modem_start();
} else {
esp_mesh_fix_root(true);
ESP_ERROR_CHECK_WITHOUT_ABORT(esp_mesh_set_type(MESH_IDLE));
}
// Force a scan immediately
ESP_LOGW(TAG, "Checking wifi connectivity");
// We can do the scan only when the modem is sustaining the mesh connectivity
esp_mesh_disconnect();
esp_wifi_scan_stop();
// Remove any scan results that have been obtained (just in case)
esp_mesh_flush_scan_result();
wifi_scan_config_t scan_config = { 0 };
scan_config.show_hidden = 1;
scan_config.scan_type = WIFI_SCAN_TYPE_PASSIVE;
esp_wifi_scan_start(&scan_config, false);
}
What is the actual behavior?
Whenever the router is changed after disconnection of the AP the mesh event loop starts sending an infinite amount of
W (12:41:31.450) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:31.577) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:31.703) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:31.832) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:31.958) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:32.084) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:32.210) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:32.337) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:32.462) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:32.589) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:32.717) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:32.843) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:32.970) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:33.096) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:33.222) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:33.349) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:33.475) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:33.603) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:33.730) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:33.856) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:33.982) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
Please note that this is not related to setting the router to NULL.
I would have expected the parent feedback to stop since there is no router defined and the ROOT is fixed.
Steps to reproduce.
- Use hotspot and connect your device to it
- Switch off the hotspot
Debug Logs.
I (12:56:23.443) mqtt_app: sent publish returned msg_id=62677
I (12:56:23.723) mqtt_app: MQTT_EVENT_PUBLISHED, msg_id=62677
------------- AP SWITCHED OFF HERE
I (637633) wifi:state: run -> init (1a0)
I (637634) wifi:pm stop, total sleep time: 0 us / 70278290 us
W (637635) wifi:<ba-del>idx
I (637635) wifi:new:<11,0>, old:<11,2>, ap:<11,2>, sta:<11,0>, prof:11
W (12:56:57.673) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 1 WIFI_REASON_UNSPECIFIED
I (12:56:57.677) mesh_netif: Clearing interface <mesh_ap>
E (12:56:57.684) TRANSPORT_BASE: poll_read select error 113, errno = Software caused connection abort, fd = 62
W (12:56:57.691) mesh_main: Start checking for FIXED ROOT handover
E (12:56:57.702) MQTT_CLIENT: Poll read error: 119, aborting connection
I (12:56:57.709) mqtt_app: MQTT_EVENT_DISCONNECTED
W (12:57:01.258) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:04.828) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
E (645775) wifi:AP has neither DSSS parameter nor HT Information, drop it
I (12:57:07.714) mqtt_app: MQTT_EVENT_BEFORE_CONNECT
W (12:57:08.400) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:11.972) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:15.543) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:19.114) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:22.686) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
I (12:57:23.433) processor: send status... (60000ms)
W (12:57:26.275) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
------------- ROUTER SET TO NULL HERE
W (12:57:27.701) mesh_main: Triggering FIXED ROOT handover
W (12:57:27.917) mesh_main: Checking wifi connectivity
W (12:57:27.919) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 106 ERROR
I (12:57:27.929) mesh_main: <MESH_EVENT_SCAN_DONE>number:0
W (12:57:28.051) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:28.179) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:28.305) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:28.431) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:28.558) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:28.687) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:28.813) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:28.939) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:29.066) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:29.193) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:29.318) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:29.444) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:29.573) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:29.698) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:29.824) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:29.950) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:30.078) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:30.204) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
More Information.
I will propagate this to EspressIF sales because we need high priority in understanding how to solve it đ
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 2
- Comments: 176 (119 by maintainers)
@mhdong attaching both files with timestamps, modem stuck
device_modem_0302_1.txt device_7440_0302_1.txt
edit: attaching logs also from our other small setup
230302_modem_device.txt 230302_non_modem_device.txt
@zhangyanjiaoesp i managed to reproduce again the blocked root device. This time i had a monitoring system attached to it.
Here i will post an analysis of what is happening, it should help you solve the issue.
this looks like problem number 1
this looks like problem number 2
So letâs go to check the TCPIP task.
AND WE HAVE A DEADLOCK between MTXON and TCPIP task!
Since i cannot look at MTXON task, i cannot continue and solve the issue.
@zhangyanjiaoesp @mhdong With heap poisoning set to comprehensive itâs basically impossible to do a node handover due to either infinite loops or crashes.
I guess this is the main difference, not the latest library.
@KonssnoK Please use the new wifi lib to test whether the stuck issue has been solved, thanks. (wifi firmware version: 789e1fa9) esp32s3_789e1fa9.zip
Hi @KonssnoK We are still debugging the problem and still havenât found the root cause. It looks like a tcpip thread blocked, but it doesnât locate fixed callback function or time out callback function. Sometimes it stops at lwip_netconn_do_gethostbyname, sometimes it stops at dns_tmr, sometimes it stops at igmp_tmr. It should be an api of mesh that broke the state of tcpip. Now we are analyzing the internal function about mesh.
full_log_modem_device.txt
ok we triggered it also on an N8R8 with the test made by a different colleague, in a different country and different laptop.
It took 3 retries to trigger it + some retries at the beginning to understand what to do
@KonssnoK Please use the new wifi lib to test whether the stuck issue has been solved, thanks. wifi_lib_0227.zip (wifi firmware version: 8b575e5)
@KonssnoK I have checked the stuck logs on the modem device, when this issue happens, the modem device is root node.
And I have checked the mesh code, the
MTXONtask is a flow control task, itâs only used to send packets upward from a non-root node. When the node is root, the task will remain blocked in the receiving queue.From the stuck log, I suspect the problem may be in the
esp_mesh_send()function. And I have added some debug log, please use the following wifi lib to have a test. wifi_lib_0222.zip (wifi firmware version: 94a3ec8)By the way, please try to call
esp_mesh_send_block_time()to set the blocking time ofesp_mesh_send(), and test whether this can solve the stuck issue or not.@KonssnoK
I will continue debug the MTXON task on my side.
@KonssnoK I will start my holiday from Thursday 1/19, and I will looking at this issue until the last day.
Self-organizing network conflicts with esp_wifi_scan_start(). When STA is connecting, scan are not allowed! Please disable selfăorganized before wifi scanďź