esp-idf: [ESP-MESH] Wifi scan triggers infinite loop with event (IDFGH-8497)

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v4.4.2-381-gc97755c9ae5

Operating System used.

Windows

How did you build your project?

VS Code IDE

If you are using Windows, please specify command line type.

PowerShell

Development Kit.

ESP32S3-WROOM N8R2

Power Supply used.

USB

What is the expected behavior?

the idea is quite simple: when the mesh root loses the AP (AP goes offline), we change the router of the esp mesh network to NULL, in order to allow a routerless network. We then recover the AP with periodic scans

Standard config

    mesh_cfg_t cfg = MESH_INIT_CONFIG_DEFAULT();
    // mesh ID
    memcpy((uint8_t*)&cfg.mesh_id, config_net->mesh_id, MAC_ADDR_LEN);
    // router
    cfg.channel = config_net->channel;
    cfg.allow_channel_switch = true;
    cfg.router.ssid_len = strlen((const char*)config_net->ssid);
    memcpy((uint8_t*)&cfg.router.ssid, config_net->ssid, cfg.router.ssid_len);
    memcpy((uint8_t*)&cfg.router.password, config_net->password, strlen((const char*)config_net->password));
    // mesh softAP
    ESP_ERROR_CHECK(esp_mesh_set_ap_authmode(self->config_mesh.ap_auth_mode));
    cfg.mesh_ap.max_connection = self->config_mesh.ap_max_connections;
    cfg.mesh_ap.nonmesh_max_connection = self->config_mesh.ap_nonmesh_max_connections;
    memcpy((uint8_t*)&cfg.mesh_ap.password, config_net->mesh_password, strlen((const char*)config_net->mesh_password));
    // Actualize the settings
    ESP_ERROR_CHECK(esp_mesh_set_config(&cfg));
    // Store the router info for hybrid fixed root meshing
    memcpy(&self->mesh_router, &cfg.router, sizeof(mesh_router_t));

Handling of the disconnection from the AP, run after 30 seconds the device does not reconnect.

    if (esp_mesh_get_self_organized()) {
        esp_mesh_set_self_organized(false, false);

        if (device_to_be_root()) {
            ESP_ERROR_CHECK_WITHOUT_ABORT(esp_mesh_set_type(MESH_ROOT));
            mesh_router_t no_router = { 0 };
            ESP_ERROR_CHECK_WITHOUT_ABORT(esp_mesh_set_router(&no_router));
            modem_start();
        } else {
            esp_mesh_fix_root(true);
            ESP_ERROR_CHECK_WITHOUT_ABORT(esp_mesh_set_type(MESH_IDLE));
        }

        // Force a scan immediately
        ESP_LOGW(TAG, "Checking wifi connectivity");
        // We can do the scan only when the modem is sustaining the mesh connectivity
        esp_mesh_disconnect();
        esp_wifi_scan_stop();
        // Remove any scan results that have been obtained (just in case)
        esp_mesh_flush_scan_result();
        wifi_scan_config_t scan_config = { 0 };
        scan_config.show_hidden = 1;
        scan_config.scan_type = WIFI_SCAN_TYPE_PASSIVE;
        esp_wifi_scan_start(&scan_config, false);
    }

What is the actual behavior?

Whenever the router is changed after disconnection of the AP the mesh event loop starts sending an infinite amount of

W (12:41:31.450) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:31.577) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:31.703) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:31.832) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:31.958) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:32.084) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:32.210) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:32.337) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:32.462) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:32.589) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:32.717) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:32.843) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:32.970) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:33.096) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:33.222) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:33.349) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:33.475) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:33.603) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:33.730) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:33.856) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:41:33.982) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND

Please note that this is not related to setting the router to NULL.

I would have expected the parent feedback to stop since there is no router defined and the ROOT is fixed.

Steps to reproduce.

  1. Use hotspot and connect your device to it
  2. Switch off the hotspot

Debug Logs.

I (12:56:23.443) mqtt_app: sent publish returned msg_id=62677
I (12:56:23.723) mqtt_app: MQTT_EVENT_PUBLISHED, msg_id=62677

------------- AP SWITCHED OFF HERE


I (637633) wifi:state: run -> init (1a0)
I (637634) wifi:pm stop, total sleep time: 0 us / 70278290 us

W (637635) wifi:<ba-del>idx
I (637635) wifi:new:<11,0>, old:<11,2>, ap:<11,2>, sta:<11,0>, prof:11
W (12:56:57.673) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 1 WIFI_REASON_UNSPECIFIED
I (12:56:57.677) mesh_netif: Clearing interface <mesh_ap>
E (12:56:57.684) TRANSPORT_BASE: poll_read select error 113, errno = Software caused connection abort, fd = 62
W (12:56:57.691) mesh_main: Start checking for FIXED ROOT handover
E (12:56:57.702) MQTT_CLIENT: Poll read error: 119, aborting connection
I (12:56:57.709) mqtt_app: MQTT_EVENT_DISCONNECTED

W (12:57:01.258) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:04.828) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
E (645775) wifi:AP has neither DSSS parameter nor HT Information, drop it
I (12:57:07.714) mqtt_app: MQTT_EVENT_BEFORE_CONNECT
W (12:57:08.400) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:11.972) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:15.543) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:19.114) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:22.686) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
I (12:57:23.433) processor: send status... (60000ms)
W (12:57:26.275) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND

------------- ROUTER SET TO NULL HERE


W (12:57:27.701) mesh_main: Triggering FIXED ROOT handover
W (12:57:27.917) mesh_main: Checking wifi connectivity
W (12:57:27.919) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 106 ERROR
I (12:57:27.929) mesh_main: <MESH_EVENT_SCAN_DONE>number:0
W (12:57:28.051) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:28.179) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:28.305) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:28.431) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:28.558) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:28.687) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:28.813) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:28.939) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:29.066) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:29.193) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:29.318) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:29.444) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:29.573) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:29.698) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:29.824) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:29.950) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:30.078) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND
W (12:57:30.204) mesh_main: <MESH_EVENT_PARENT_DISCONNECTED>reason: 201 WIFI_REASON_NO_AP_FOUND

More Information.

I will propagate this to EspressIF sales because we need high priority in understanding how to solve it 😃

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 2
  • Comments: 176 (119 by maintainers)

Most upvoted comments

@mhdong attaching both files with timestamps, modem stuck

device_modem_0302_1.txt device_7440_0302_1.txt

edit: attaching logs also from our other small setup

230302_modem_device.txt 230302_non_modem_device.txt

@zhangyanjiaoesp i managed to reproduce again the blocked root device. This time i had a monitoring system attached to it.

Here i will post an analysis of what is happening, it should help you solve the issue.

  • As i already noticed, the MQTT task becomes blocked
  • the block is coming from semaphore 3FCCF2F0, which is being fed by task “tiT”, which is the TCPIP_THREAD image

this looks like problem number 1

  • MTXON task starts looping aggressively on a queue (3FCC5694) every millisecond
  • This queue apparently is also fed by the TCPIP task image

this looks like problem number 2

So let’s go to check the TCPIP task.

  • It’s blocked on receiving queue 3FCC4AEC indefinitely
  • This queue seems to be fed by the MTXON task! image image

AND WE HAVE A DEADLOCK between MTXON and TCPIP task!

Since i cannot look at MTXON task, i cannot continue and solve the issue.

@zhangyanjiaoesp @mhdong With heap poisoning set to comprehensive it’s basically impossible to do a node handover due to either infinite loops or crashes.

I guess this is the main difference, not the latest library.


W (475467) mesh: [mesh_schedule.c,3130] [WND-RX]max_wnd:2, 1200 ms timeout, seqno:0, xseqno:29, no_wnd_count:0, timeout_count:1
E (07:07:23.813) mesh_netif: Send with err code 16394 ESP_ERR_MESH_TIMEOUT
CORRUPT HEAP: Invalid data at 0x3fcb541c. Expected 0xfefefefe got 0x001cfefe
CORRUPT HEAP: Invalid data at 0x3fcb5420. Expected 0xfefefefe got 0xfefe0000

assert failed: multi_heap_malloc multi_heap_poisoning.c:241 (ret)


Backtrace: 0x40375ace:0x3fceae50 0x4037c0c9:0x3fceae70 0x40382d61:0x3fceae90 0x40381e94:0x3fceafb0 0x40375dc9:0x3fceafd0 0x40375e29:0x3fceaff0 0x40375e66:0x3fceb010 0x40382d71:0x3fceb030 0x40378bb1:0x3fceb050 0x4206fcf3:0x3fceb070 0x4206fd50:0x3fceb090 0x4204e34d:0x3fceb0b0 0x42009658:0x3fceb110 0x4200968d:0x3fceb150 0x420b4241:0x3fceb170 0x4202e42b:0x3fceb190 0x4202dec1:0x3fceb1b0 0x42026bb6:0x3fceb1d0 0x42026eb6:0x3fceb1f0 0x42027fd9:0x3fceb220 0x4202802d:0x3fceb260 0x42028055:0x3fceb290 0x42022a9a:0x3fceb2c0 0x420236e9:0x3fceb2f0 0x42020225:0x3fceb320 0x420202ef:0x3fceb350 0x42023bc7:0x3fceb370 0x42023c89:0x3fceb390 0x4201b1b2:0x3fceb3b0 0x4201b267:0x3fceb3d0 0x4037f3f6:0x3fceb400
0x40375ace: panic_abort at C:/src/esp-idf/components/esp_system/panic.c:408

0x4037c0c9: esp_system_abort at C:/src/esp-idf/components/esp_system/esp_system.c:137

0x40382d61: __assert_func at C:/src/esp-idf/components/newlib/assert.c:85

0x40381e94: multi_heap_malloc at C:/src/esp-idf/components/heap/multi_heap_poisoning.c:241 (discriminator 1)

0x40375dc9: heap_caps_malloc_base at C:/src/esp-idf/components/heap/heap_caps.c:175

0x40375e29: heap_caps_malloc at C:/src/esp-idf/components/heap/heap_caps.c:195

0x40375e66: heap_caps_malloc_default at C:/src/esp-idf/components/heap/heap_caps.c:220

0x40382d71: malloc at C:/src/esp-idf/components/newlib/heap.c:24

0x40378bb1: wifi_malloc at C:/src/esp-idf/components/esp_wifi/esp32s3/esp_adapter.c:73

0x4206fcf3: mesh_malloc at ??:?

0x4206fd50: esp_mesh_create_context at ??:?

0x4204e34d: esp_mesh_send at ??:?

0x42009658: mesh_netif_transmit_from_node_sta at C:/src/esp-idf/examples/mesh/ip_internal_network/main/mesh_netif.c:186

0x4200968d: mesh_netif_transmit_from_node_sta_wrap at C:/src/esp-idf/examples/mesh/ip_internal_network/main/mesh_netif.c:195

0x420b4241: esp_netif_transmit_wrap at C:/src/esp-idf/components/esp_netif/lwip/esp_netif_lwip.c:898

0x4202e42b: low_level_output at C:/src/esp-idf/components/lwip/port/esp32/netif/wlanif.c:137

0x4202dec1: ethernet_output at C:/src/esp-idf/components/lwip/lwip/src/netif/ethernet.c:312 (discriminator 2)

0x42026bb6: etharp_output_to_arp_index at C:/src/esp-idf/components/lwip/lwip/src/core/ipv4/etharp.c:785

0x42026eb6: etharp_output at C:/src/esp-idf/components/lwip/lwip/src/core/ipv4/etharp.c:884

0x42027fd9: ip4_output_if_opt_src at C:/src/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:1108

0x4202802d: ip4_output_if_opt at C:/src/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:909

0x42028055: ip4_output_if at C:/src/esp-idf/components/lwip/lwip/src/core/ipv4/ip4.c:886

0x42022a9a: tcp_output_segment at C:/src/esp-idf/components/lwip/lwip/src/core/tcp_out.c:1607 (discriminator 4)

0x420236e9: tcp_output at C:/src/esp-idf/components/lwip/lwip/src/core/tcp_out.c:1360

0x42020225: tcp_slowtmr at C:/src/esp-idf/components/lwip/lwip/src/core/tcp.c:1431

0x420202ef: tcp_tmr at C:/src/esp-idf/components/lwip/lwip/src/core/tcp.c:242

0x42023bc7: tcpip_tcp_timer at C:/src/esp-idf/components/lwip/lwip/src/core/timeouts.c:161

0x42023c89: sys_check_timeouts at C:/src/esp-idf/components/lwip/lwip/src/core/timeouts.c:411

0x4201b1b2: tcpip_timeouts_mbox_fetch at C:/src/esp-idf/components/lwip/lwip/src/api/tcpip.c:115

0x4201b267: tcpip_thread at C:/src/esp-idf/components/lwip/lwip/src/api/tcpip.c:148

0x4037f3f6: vPortTaskWrapper at C:/src/esp-idf/components/freertos/port/xtensa/port.c:142

@KonssnoK Please use the new wifi lib to test whether the stuck issue has been solved, thanks. (wifi firmware version: 789e1fa9) esp32s3_789e1fa9.zip

Hi @KonssnoK We are still debugging the problem and still haven’t found the root cause. It looks like a tcpip thread blocked, but it doesn’t locate fixed callback function or time out callback function. Sometimes it stops at lwip_netconn_do_gethostbyname, sometimes it stops at dns_tmr, sometimes it stops at igmp_tmr. It should be an api of mesh that broke the state of tcpip. Now we are analyzing the internal function about mesh.

full_log_modem_device.txt

ok we triggered it also on an N8R8 with the test made by a different colleague, in a different country and different laptop.

It took 3 retries to trigger it + some retries at the beginning to understand what to do

@KonssnoK Please use the new wifi lib to test whether the stuck issue has been solved, thanks. wifi_lib_0227.zip (wifi firmware version: 8b575e5)

@KonssnoK I have checked the stuck logs on the modem device, when this issue happens, the modem device is root node.

And I have checked the mesh code, the MTXON task is a flow control task, it’s only used to send packets upward from a non-root node. When the node is root, the task will remain blocked in the receiving queue.

From the stuck log, I suspect the problem may be in the esp_mesh_send() function. And I have added some debug log, please use the following wifi lib to have a test. wifi_lib_0222.zip (wifi firmware version: 94a3ec8)

By the way, please try to call esp_mesh_send_block_time() to set the blocking time of esp_mesh_send(), and test whether this can solve the stuck issue or not.

image

@KonssnoK
I will continue debug the MTXON task on my side.

@KonssnoK I will start my holiday from Thursday 1/19, and I will looking at this issue until the last day.

Self-organizing network conflicts with esp_wifi_scan_start(). When STA is connecting, scan are not allowed! Please disable self organized before wifi scan.