esp-idf: WiFi connected, have IP, but IP stack broken (IDFGH-9)
I’m experiencing an issue where WiFi gets into a state where it thinks it’s connected, and has an IP, but all communications to and from the ESP fail, including pings.
This seems to happen after some random number of disconnect / reconnect cycles. I will get a SYSTEM_EVENT_STA_CONNECTED followed by SYSTEM_EVENT_STA_GOT_IP but the ESP can’t talk or be talked to.
Here is a log from a recent example:
wifi_disconnect
I (88367) wifi: state: run -> init (0)
I (88367) wifi: pm stop, total sleep time: 14144338 us / 24365739 us
I (88367) wifi: n:11 0, o:11 0, ap:255 255, sta:11 0, prof:1
S (88377) x_wifi.c:224 (0x0000) SYSTEM_EVENT_STA_DISCONNECTED ssid "xx", bssid 80:2a:a8:03:0d:5b, reason WIFI_REASON_ASSOC_LEAVE.
S (88397) x_wifi.c:331 (0x0000) WiFi connecting to "xx".
E (88517) x_mqtt.c:251 (0xffffffff) MQTTPublish
I (88517) wifi: n:11 0, o:11 0, ap:255 255, sta:11 0, prof:1
I (89277) wifi: state: init -> auth (b0)
I (89287) wifi: state: auth -> assoc (0)
I (89287) wifi: state: assoc -> run (10)
I (89317) wifi: connected with xx, channel 11
I (89317) wifi: pm start, type: 1
S (89317) x_wifi.c:181 (0x0000) SYSTEM_EVENT_STA_CONNECTED ssid "xx", bssid 80:2a:a8:03:0d:5b, channel 11, authmode WIFI_AUTH_WPA2_PSK.
I (90037) x_mqtt.c:194 (0x0000) WiFi not connected, waiting for it.
I (90257) event: sta ip: 192.168.10.141, mask: 255.255.255.0, gw: 192.168.10.1
S (90257) x_wifi.c:200 (0x0000) SYSTEM_EVENT_STA_GOT_IP IP 192.168.10.141, GW 192.168.10.1, Netmask 255.255.255.0.
I (90367) x_mqtt.c:112 (0x0000) Connecting to MQTT at xx, use_tls = 1
I (93067) x_mqtt.c:103 (0x0000) MQTT subscribed to xx
I (93237) x_mqtt.c:103 (0x0000) MQTT subscribed to xx
I (93397) x_mqtt.c:103 (0x0000) MQTT subscribed to xx
S (93497) x_mqtt.c:170 (0x0000) xx connected to MQTT.
W (94337) wifi: alloc eb len=24 type=3 fail, heap:1540424
W (94337) wifi: m f null
W (95057) wifi: alloc eb len=24 type=3 fail, heap:1538656
W (95067) wifi: m f null
----------------- failures started here ---------------
E (115097) HTTP_REQ: Error write header
E (115097) HTTP_REQ: Error send request
I (115107) x_http_ap:104 (0x0108) API ERROR Status Code 0
E (115117) x_http_ap:167 (0x0108) err
E (115187) HTTP_REQ: Error write header
E (115187) HTTP_REQ: Error send request
I suspect that the following lines have something to do with it:
W (94337) wifi: alloc eb len=24 type=3 fail, heap:1540424
W (94337) wifi: m f null
W (95057) wifi: alloc eb len=24 type=3 fail, heap:1538656
W (95067) wifi: m f null
Once it gets in this state, it seems to persist until I force a WiFi disconnect and reconnect. During this time, attempting to ping the ESP from another computer on the network times out.
Environment
- Development Kit: Custom
- Core (if using chip or module): ESP32-Wrover
- IDF version: 52f9a5ca
- Power Supply: external 3.3V
Thanks, Jason
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 33 (15 by maintainers)
Commits related to this issue
- Increase _network_event_task priority (#2184) Fixes https://github.com/espressif/arduino-esp32/issues/1595 — committed to 0xFEEDC0DE64/esp-idf by copercini 6 years ago
Hi @burkulesomesh43 & @Ritesh236,
My guess is that this isn’t a bug in ESP-IDF, and is probably related to your application code.
I am not a contributor, but this might help:
CONFIG_WIFI_LWIP_ALLOCATION_FROM_SPIRAM_FIRSTis not compulsory, but it is sensible if you have PSRAM installed. LWIP will then useheap_caps_malloc(sz, MALLOC_CAP_SPIRAM)where it can, reducing internal memory usage.heap_caps_malloc(sz, MALLOC_CAP_8BIT). This often results in internal memory being allocated.MALLOC_CAP_SPIRAMwhenever I can in application code. This is usually enough to solve the WiFi management buffer alloc failures.CONFIG_SPIRAM_MALLOC_ALWAYSINTERNALto set a threshold if allocating a size of < threshold then prefer internal memory, else prefer external / PSRAM.CONFIG_SPIRAM_MALLOC_RESERVE_INTERNALto reserve an internal memory area for only DMA /MALLOC_CAP_INTERNALallocations.Read these pages, and it should become clear:
Although the above docs are for ‘latest’, a lot of the fundamentals apply to ESP-IDF v3.
liuzfesp did a good write-up on
alloc eb fail: https://github.com/espressif/esp-idf/issues/1021#issuecomment-469502062Finally, it may be better to post on the ESP32 forums, rather than this issue, as AFAIK it’s not a bug in IDF.
I hope you manage to fix your problem. Good luck.
https://github.com/espressif/esp-idf/issues/3592#issuecomment-517926444
Thanks @talss89 for your valuable feedback.
So, I have already checked few configurations earlier but still we will check it again and get back to you if any query regarding that.
And also, We will create issue over ESP32 community with details for same to get any other help regarding same.
Thanks again for your valuable feedback
Regards, Ritesh Prajapati
Alright, I think I am pretty close to a root cause on this. It appears to be related to memory allocation in the WiFi or TCP stack. I managed to get a new error message that seems to back this up:
When my app starts it immediately tries to connect to a stored WiFi AP. Another task is blocking, waiting for WiFi to be connected, and as soon as it is, it tries to connect to MQTT over TLS. And yet another task is trying to send some requests to an HTTPS API if WiFi is connected.
I think what is actually happening is that DMA capable RAM is being exhausted. If I put a long delay in my code before hitting the HTTPS API everything is fine until it tries to connect, then I get the “null” failure. Here is a log with some memory logging included:
I’m investigating now if I can find a way to free up more DMA capable RAM to see if that helps.
One other note: It seems like when these TLS connections fail they don’t give back the RAM. Could be a memory leak, or could just be that this is an unrecoverable error and I should be rebooting anyway.
And finally, I’ll note that I’m now on the 3.0.2 release instead of the detached HEAD I was using before.
Thanks, Jason
HI @talss89, no I’m not using AWS IOT, but I am using a custom port of Apache Paho and they are similar, and both use mbedtls. For what it’s worth, I see this error both when connecting to MQTT and when trying to HTTPS connections, and as you note, it does seem to happen as soon as I start to connect to a service, rather than as soon as WiFi connects.
I have long suspected this had something to do with mbedtls, but it’s only a guess.