esp-idf: httpclient has unexpected and unhadled error that causes the call to not return during esp_https_ota (IDFGH-4543)

Environment

  • Development Kit: [none]
  • Module or chip used: [ESP32-WROOM-32D]
  • IDF version (run git describe --tags to find it): v4.2-47-g2532ddd9f
  • Build System: [idf.py]
  • Compiler version (run xtensa-esp32-elf-gcc --version to find it): xtensa-esp32-elf-gcc (crosstool-NG esp-2020r3) 8.4.0
  • Operating System: [Linux]
  • Using an IDE?: [No]
  • Power Supply: [external 3.3V]

Problem Description

Sometimes when I call esp_err_t ret = esp_https_ota(&config); the function never returns. It prints a few log messages referencing EAGAIN. There is no more OTA relevant information in the log.

Expected Behavior

  1. esp_https_ota should not block for 10+ hours
  2. As a user of esp_https_ota i expect the library to handle any errors or return a suitable esp_err_t
  3. It looks like EAGAIN is related to the non-blocking socket API, but esp_https_ota is blocking (as noted here). Thus it seems EAGAIN should not be a possible errno even in the internal implementation of esp_https_ota

Actual Behavior

esp_http_client_config_t config = {
        .url = image_location,
        .cert_pem = (char *)server_cert_pem_start,
        .event_handler = _http_event_handler,
	.timeout_ms = 20000,
};
esp_err_t ret = esp_https_ota(&config);
ESP_LOGI(TAG, "sometimes the above call never returns");

When esp_https_ota does not return, I get log output that looks like this:

D (08:01:36.759) esp_https_ota: Written image length 1104558
D (08:01:36.766) esp_https_ota: Written image length 1104847
D (08:01:36.772) esp_https_ota: Written image length 1105136
D (08:01:36.779) esp_https_ota: Written image length 1105425
D (08:01:36.785) esp_https_ota: Written image length 1105714
E (08:01:56.782) TRANS_SSL: esp_tls_conn_read error, errno=No more processes
W (08:01:56.783) HTTP_CLIENT: esp_transport_read returned:-26880 and errno:11 
D (08:01:56.792) esp_https_ota: Written image length 1105920

No more OTA logs are printed. The MQTT client on the system keeps working.

The issue only occurs on the office Wi-Fi network. When using an alternative wifi network, that is used for development only, the issue went away.

Steps to reproduce

I have not found a good way to reproduce this issue. As it is dependent on the wifi network used, I expect that this is tricky to reproduce. When testing in my environment I ran the code above on startup, patched esp_https_ota to do esp_restart before esp_https_ota_finish. This results in a system that downloads the ota image, then reboots. Eventually, it will hang in esp_https_ota as described above.

Other items

It may be that the issue is a lack of error handling in esp_https_ota. Alternatively, there is an error in the HTTP client that should not occur.

Please let me know if there is any additional information I can provide.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 33 (13 by maintainers)

Most upvoted comments

@ESP-YJM If you could check the logs then on line number 10947, you can see below logs where OTA fails

D (137979) SSL TLS: RX left 16408 bytes�[0m �[0;31mE (137983) TRANS_SSL: esp_tls_conn_read error, errno=No more processes�[0m D (137991) HTTP_CLIENT: need_read=7, byte_to_read=7, rlen=-26880, ridx=121�[0m �[0;33mW (137997) HTTP_CLIENT: esp_transport_read returned:-26880 and errno:11 �[0m D (138005) HTTP_CLIENT: is_data_remain=1, is_chunked=0, content_length=1974083�[0m lwip_select(63, 0x3fff4c58, 0x0, 0x3fff4c60, tvsec=3 tvusec=0) tcp_slowtmr: processing active pcb tcp_slowtmr: processing active pcb tcp_slowtmr: polling application tcp_slowtmr: processing active pcb tcp_slowtmr: processing active pcb tcp_slowtmr: polling application tcp_slowtmr: processing active pcb tcp_keepalive: sending KEEPALIVE probe to 20.73.224.89 tcp_keepalive: tcp_ticks 222 pcb->tmr 211 pcb->keep_cnt_sent 0 tcp_keepalive: seqno 1887117953 ackno 1290708190 err 0. lwip_selscan: fd=62 ready for reading

@softdel1003 I don’t think the OTA fail, you can check line 11076. OTA still works.

@arntdr This has been merged internally, it will likely appear on github with next codebase sync event (automated process but it should happen in next few days). For release, this will be part of v4.3 release, expected (roughly) towards end of Feb timeline.

@AxelLin keep_idle time should be ideally equal to connection timeout, that’s why it was added this way. However, I’ll consider this point in internal discussion

keep_idle time is nothing to do with connection timeout. keep_idle will be checked when pcb is in ESTABLISHED or CLOSE_WAIT state.