esp-mqtt: esp_mqtt_client_enqueue() can block the caller in unreliable connection cases (IDFGH-7853)

Hi all,

Finding

esp_mqtt_client_enqueue() can block for the MQTT network timeout time.

Expectations

esp_mqtt_client_enqueue() never blocks, in no circumstance (or at least blocks only for very short amount of time in the range of milliseconds).
The blocking nature of esp_mqtt_client_enqueue() is documented

Long Version

From the documentation (... could be used as a non blocking version of esp_mqtt_client_publish()), I have the expectation, that esp_mqtt_client_enqueue() never blocks (or at least it blocks only very short in the range of milliseconds). I wanted to use esp_mqtt_client_enqueue() to avoid creating an additional MQTT task (for saving resources), as ESP-MQTT is creating a task already. But the function can block for the time used for connection timeout and I’m using the watchdog with a way shorter timeout for the caller task.

The happy path works great, no problem. I came across the bug when doing some smoke tests, as I must expect an unreliable internet connection and I must also expect an unreliable MQTT server. If the MQTT server is no more available (e.g. unplug the LAN port of the server to simulate an ungraceful shutdown), the network communication hangs for the timeout (which is by default 10 seconds). If I call esp_mqtt_client_enqueue() in this time, the function will block the caller until the network connection timeout is over.

The same happens if the MQTT server is not available from the start of the ESP32 program (like pointing the client to e.g. ws://8.8.8.8:80) and calling esp_mqtt_client_enqueue() regardless of the connection state of ESP-MQTT, the caller will be blocked until the network connection timeout triggers.

I came across this problem, because my device rebooted, because of a trigger of the watchdog. The caller of esp_mqtt_client_enqueue() is a task with default watchdog of 5 seconds. It handles also the GUI, which now is from time to time unresponsive. I know, I could create a separate FreeRTOS task as a workaround for this case, but it is ridiculous to have an additional task when ESP-MQTT uses already a dedicated task…

Details

I’m sorry to not provide an example that reproduces the problem, as it would need a system of multiple hosts and manually unplugging the LAN connection on the server.

The dependency to network timeouts arise in the use of MQTT_API_LOCK() / MQTT_API_UNLOCK(), as the same lock is used also in the function esp_mqtt_task() in which the lock can be acquired with afterwards calling blocking functions.

From my understanding, the lock should be acquired only very shortly and never while calling blocking functions. Or if that is needed, the lock for esp_mqtt_task() should be separated from the lock used by esp_mqtt_client_enqueue(). Or a buffer is used, which never blocks the writer for putting messages in the queue.

Question

Can you confirm this bug? Do you need any additional information?

Thanks in advance

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 19

Commits related to this issue

mqtt: Set state on stoping; Add error code to Subscribed event * Update submodule: git log --oneline ae53d799da294f03ef65c33e88fa33648e638134..fde00340f19b9f5ae81fff02ccfa9926f0e33687 Detailed descr... — committed to espressif/esp-idf by euripedesrocha 2 years ago
client: Add support for user events Also supporting configurable queue size for the internal event loop. Closes https://github.com/espressif/esp-mqtt/issues/230 — committed to egnor/esp-mqtt by david-cermak 2 years ago
client: Add support for user events Also supporting configurable queue size for the internal event loop. Closes https://github.com/espressif/esp-mqtt/issues/230 — committed to egnor/esp-mqtt by david-cermak 2 years ago
client: Add support for user events Also supporting configurable queue size for the internal event loop. Closes https://github.com/espressif/esp-mqtt/issues/230 — committed to EmbeddedSystemClass/esp-mqtt-1 by david-cermak 2 years ago

Most upvoted comments

@michaelgaehwiler Thanks for the feedback. Yes, I think the user-event would be provided before implementing separate locking.

david-cermak on Jul 29, 2022