esp-idf: WiFi connected, have IP, but IP stack broken (IDFGH-9)

I’m experiencing an issue where WiFi gets into a state where it thinks it’s connected, and has an IP, but all communications to and from the ESP fail, including pings.

This seems to happen after some random number of disconnect / reconnect cycles. I will get a SYSTEM_EVENT_STA_CONNECTED followed by SYSTEM_EVENT_STA_GOT_IP but the ESP can’t talk or be talked to.

Here is a log from a recent example:

wifi_disconnect
I (88367) wifi: state: run -> init (0)
I (88367) wifi: pm stop, total sleep time: 14144338 us / 24365739 us

I (88367) wifi: n:11 0, o:11 0, ap:255 255, sta:11 0, prof:1
S (88377)  x_wifi.c:224  (0x0000) SYSTEM_EVENT_STA_DISCONNECTED ssid "xx", bssid 80:2a:a8:03:0d:5b, reason WIFI_REASON_ASSOC_LEAVE.
S (88397)  x_wifi.c:331  (0x0000) WiFi connecting to "xx".
E (88517)  x_mqtt.c:251  (0xffffffff) MQTTPublish
I (88517) wifi: n:11 0, o:11 0, ap:255 255, sta:11 0, prof:1
I (89277) wifi: state: init -> auth (b0)
I (89287) wifi: state: auth -> assoc (0)
I (89287) wifi: state: assoc -> run (10)
I (89317) wifi: connected with xx, channel 11
I (89317) wifi: pm start, type: 1

S (89317)  x_wifi.c:181  (0x0000) SYSTEM_EVENT_STA_CONNECTED ssid "xx", bssid 80:2a:a8:03:0d:5b, channel 11, authmode WIFI_AUTH_WPA2_PSK.
I (90037)  x_mqtt.c:194  (0x0000) WiFi not connected, waiting for it.
I (90257) event: sta ip: 192.168.10.141, mask: 255.255.255.0, gw: 192.168.10.1
S (90257)  x_wifi.c:200  (0x0000) SYSTEM_EVENT_STA_GOT_IP IP 192.168.10.141, GW 192.168.10.1, Netmask 255.255.255.0.
I (90367)  x_mqtt.c:112  (0x0000) Connecting to MQTT at xx, use_tls = 1
I (93067)  x_mqtt.c:103  (0x0000) MQTT subscribed to xx
I (93237)  x_mqtt.c:103  (0x0000) MQTT subscribed to xx
I (93397)  x_mqtt.c:103  (0x0000) MQTT subscribed to xx
S (93497)  x_mqtt.c:170  (0x0000) xx connected to MQTT.
W (94337) wifi: alloc eb len=24 type=3 fail, heap:1540424

W (94337) wifi: m f null

W (95057) wifi: alloc eb len=24 type=3 fail, heap:1538656

W (95067) wifi: m f null

----------------- failures started here ---------------

E (115097) HTTP_REQ: Error write header
E (115097) HTTP_REQ: Error send request
I (115107) x_http_ap:104  (0x0108) API ERROR Status Code 0
E (115117) x_http_ap:167  (0x0108) err
E (115187) HTTP_REQ: Error write header
E (115187) HTTP_REQ: Error send request

I suspect that the following lines have something to do with it:

W (94337) wifi: alloc eb len=24 type=3 fail, heap:1540424
W (94337) wifi: m f null
W (95057) wifi: alloc eb len=24 type=3 fail, heap:1538656
W (95067) wifi: m f null

Once it gets in this state, it seems to persist until I force a WiFi disconnect and reconnect. During this time, attempting to ping the ESP from another computer on the network times out.

Environment

  • Development Kit: Custom
  • Core (if using chip or module): ESP32-Wrover
  • IDF version: 52f9a5ca
  • Power Supply: external 3.3V

Thanks, Jason

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 33 (15 by maintainers)

Commits related to this issue

Most upvoted comments

Hi @burkulesomesh43 & @Ritesh236,

My guess is that this isn’t a bug in ESP-IDF, and is probably related to your application code.

I am not a contributor, but this might help:

  • CONFIG_WIFI_LWIP_ALLOCATION_FROM_SPIRAM_FIRST is not compulsory, but it is sensible if you have PSRAM installed. LWIP will then use heap_caps_malloc(sz, MALLOC_CAP_SPIRAM) where it can, reducing internal memory usage.
  • The default ESP-IDF malloc() implementation uses heap_caps_malloc(sz, MALLOC_CAP_8BIT). This often results in internal memory being allocated.
  • It is possible that you are exhausting internal memory with your application’s malloc() usage. I prefer to use MALLOC_CAP_SPIRAM whenever I can in application code. This is usually enough to solve the WiFi management buffer alloc failures.
  • You may want to look at CONFIG_SPIRAM_MALLOC_ALWAYSINTERNAL to set a threshold if allocating a size of < threshold then prefer internal memory, else prefer external / PSRAM.
  • Also, check out CONFIG_SPIRAM_MALLOC_RESERVE_INTERNAL to reserve an internal memory area for only DMA / MALLOC_CAP_INTERNAL allocations.

Read these pages, and it should become clear:

Although the above docs are for ‘latest’, a lot of the fundamentals apply to ESP-IDF v3.

liuzfesp did a good write-up on alloc eb fail: https://github.com/espressif/esp-idf/issues/1021#issuecomment-469502062

Finally, it may be better to post on the ESP32 forums, rather than this issue, as AFAIK it’s not a bug in IDF.

I hope you manage to fix your problem. Good luck.

Thanks @talss89 for your valuable feedback.

So, I have already checked few configurations earlier but still we will check it again and get back to you if any query regarding that.

And also, We will create issue over ESP32 community with details for same to get any other help regarding same.

Thanks again for your valuable feedback

Regards, Ritesh Prajapati

Alright, I think I am pretty close to a root cause on this. It appears to be related to memory allocation in the WiFi or TCP stack. I managed to get a new error message that seems to back this up:

W (11714) wifi: alloc eb len=24 type=3 fail
W (11724) wifi: m f null
W (11794) wifi: alloc eb len=36 type=3 fail
W (11794) wifi: mem fail

When my app starts it immediately tries to connect to a stored WiFi AP. Another task is blocking, waiting for WiFi to be connected, and as soon as it is, it tries to connect to MQTT over TLS. And yet another task is trying to send some requests to an HTTPS API if WiFi is connected.

I think what is actually happening is that DMA capable RAM is being exhausted. If I put a long delay in my code before hitting the HTTPS API everything is fine until it tries to connect, then I get the “null” failure. Here is a log with some memory logging included:

| 8BIT  | 32BIT | 32BIT - 8BIT | INTERNAL | SPIRAM |  DMA  | Heap Total | Used  | Free  | SOA Used | Free  |  HW   |
|=======|=======|==============|==========|========|=======|============|=======|=======|==========|=======|=======|
|1695740|1733568|         37828|     60748| 1672820|  22920|     4375996|2678708|1697288|         1|  32767|  32606|
I (2471) wifi: n:11 0, o:1 0, ap:255 255, sta:11 0, prof:1
I (3231) wifi: state: init -> auth (b0)
I (3231) wifi: state: auth -> assoc (0)
I (3241) wifi: state: assoc -> run (10)
I (3261) wifi: connected with xx, channel 11
S (3271)  xx_wifi.c:181  (0x0000) SYSTEM_EVENT_STA_CONNECTED ssid "xx", bssid xx, channel 11, authmode WIFI_AUTH_WPA2_PSK.
I (5231) event: sta ip: 192.168.10.141, mask: 255.255.255.0, gw: 192.168.10.1
S (5231)  xx_wifi.c:200  (0x0000) SYSTEM_EVENT_STA_GOT_IP IP 192.168.10.141, GW 192.168.10.1, Netmask 255.255.255.0.
I (5361)  xx_mqtt.c:115  (0x0000) Connecting to MQTT at xx, use_tls = 1
I (6241) wifi: pm start, type:1
|1647372|1685200|         38020|     45892| 1639116|   8064|     4375996|2678708|1697288|         1|  32767|  32606|
I (9541)  xx_mqtt.c:108  (0x0000) MQTT subscribed to xx
I (9631)  xx_mqtt.c:108  (0x0000) MQTT subscribed to xx
I (9711)  xx_mqtt.c:108  (0x0000) MQTT subscribed to xx
S (9731)  xx_mqtt.c:167  (0x0000) xx xx xx connected to MQTT.
|1650656|1688484|         37828|     47784| 1640700|   9956|     4375996|2678708|1697288|         1|  32767|  32606|
|1650924|1688752|         37828|     48056| 1640696|  10228|     4375996|2678708|1697288|         1|  32767|  32606|
|1650924|1688752|         37828|     48056| 1640696|  10228|     4375996|2678708|1697288|         1|  32767|  32606|

-- RUNNING HTTPS API REQUEST

W (25721) wifi: alloc eb len=24 type=3 fail
W (25721) wifi: m f null
|1596988|1634816|         37828|     37828| 1596988|      0|     4375996|2678708|1697288|         1|  32767|  32606|
|1592124|1629952|         37828|     37828| 1592124|      0|     4375996|2678708|1697288|         1|  32767|  32606|
|1591720|1629548|         37828|     37828| 1591720|      0|     4375996|2678708|1697288|         1|  32767|  32606|

I’m investigating now if I can find a way to free up more DMA capable RAM to see if that helps.

One other note: It seems like when these TLS connections fail they don’t give back the RAM. Could be a memory leak, or could just be that this is an unrecoverable error and I should be rebooting anyway.

And finally, I’ll note that I’m now on the 3.0.2 release instead of the detached HEAD I was using before.

Thanks, Jason

HI @talss89, no I’m not using AWS IOT, but I am using a custom port of Apache Paho and they are similar, and both use mbedtls. For what it’s worth, I see this error both when connecting to MQTT and when trying to HTTPS connections, and as you note, it does seem to happen as soon as I start to connect to a service, rather than as soon as WiFi connects.

I have long suspected this had something to do with mbedtls, but it’s only a guess.