aws-iot-device-sdk-python-v2: MQTT infinite reconnect loop after server terminates connection
Describe the bug
Seeing an issue when testing corner cases for connection/network loss. If the server terminates the MQTT connection, the reconnect logic seems to fail forever on AWS_IO_SOCKET_CLOSED. Calling disconnect/connect directly seems to still reuse the same closed socket, the program needs to be restarted to recover.
Expected Behavior
If a MQTT server terminates the connection, client establishes a new connection.
Current Behavior
on_connection_interrupted callback generates expected error
AWS_ERROR_MQTT_UNEXPECTED_HANGUP: The connection was closed unexpectedly.
DEBUG log level shows reconnection attempt, but fails with “zero read, socket is closed”
[DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [task-scheduler] - id=0x7f1c84020b50: Scheduling mqtt_reconnect task for immediate execution [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [channel] - id=0x7f1c84001010: destroying channel. [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [channel-bootstrap] - id=0x1e4d6e0: releasing bootstrap reference [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [task-scheduler] - id=0x7f1c84020b50: Running mqtt_reconnect task with <Running> status [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [dns] - id=0x1b58db0: Host resolution requested for 10.36.165.254 [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [dns] - id=0x1b58db0: No cached entries found for 10.36.165.254 starting new resolver thread. [DEBUG] [2023-01-27T15:21:47Z] [00007f1c83fff700] [dns] - static: resolving host 10.36.165.254 [DEBUG] [2023-01-27T15:21:47Z] [00007f1c83fff700] [dns] - static: resolved record: 10.36.165.254 [DEBUG] [2023-01-27T15:21:47Z] [00007f1c83fff700] [dns] - static, resolving host 10.36.165.254 successful, returned 1 addresses [DEBUG] [2023-01-27T15:21:47Z] [00007f1c83fff700] [dns] - static: new address resolved 10.36.165.254 for host 10.36.165.254 caching [DEBUG] [2023-01-27T15:21:47Z] [00007f1c83fff700] [dns] - static, invoking resolution callback for host 10.36.165.254 with 1 addresses [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [task-scheduler] - id=0x7f1c78001580: Scheduling attempt_connection task for immediate execution [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [task-scheduler] - id=0x7f1c78001580: Running attempt_connection task with <Running> status [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [socket] - id=0x7f1c84000c90 fd=7: initializing with domain 0 and type 0 [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [socket] - id=0x7f1c84000c90 fd=7: setting socket options to: keep-alive 0, keep idle 0, keep-alive interval 0, keep-alive probe count 0. [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [socket] - id=0x7f1c84000c90 fd=7: beginning connect. [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [socket] - id=0x7f1c84000c90 fd=7: connecting to endpoint 10.36.165.254:1883. [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [task-scheduler] - id=0x7f1c840260b0: Scheduling (null) task for future execution at time 9144044252363669 [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [task-scheduler] - id=0x7f1c84018450: Scheduling epoll_event_loop_unsubscribe_cleanup task for immediate execution [INFO] [2023-01-27T15:21:47Z] [00007f1c899c5700] [socket] - id=0x7f1c84000c90 fd=7: connection success [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [socket] - id=0x7f1c84000c90 fd=7: assigning to event loop 0x295a6b0 [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [channel-bootstrap] - id=0x1e4d6e0: client connection on socket 0x7f1c84000c90 completed with error 0. [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [channel] - id=0x7f1c84001010: Beginning creation and setup of new channel. [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [task-scheduler] - id=0x7f1c84000f30: Scheduling on_channel_setup_complete task for immediate execution [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [task-scheduler] - id=0x7f1c84018450: Running epoll_event_loop_unsubscribe_cleanup task with <Running> status [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [task-scheduler] - id=0x7f1c84000f30: Running on_channel_setup_complete task with <Running> status [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [channel] - id=0x7f1c84001010: setup complete, notifying caller. [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [channel] - id=0x7f1c84001010: message pool 0x7f1c840013a0 found in event-loop local storage: using it. [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [channel-bootstrap] - id=0x1e4d6e0: channel 0x7f1c84001010 setup succeeded: bootstrapping. [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [socket-handler] - id=0x7f1c84011be0: Socket handler created with max_read_size of 16384 [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [task-scheduler] - id=0x7f1c84011d40: Scheduling tls_timeout task for future execution at time 9144049253347190 [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [task-scheduler] - id=0x7f1c84000e30: Scheduling socket_written_task task for immediate execution [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [task-scheduler] - id=0x7f1c84000e30: Running socket_written_task task with <Running> status [INFO] [2023-01-27T15:21:47Z] [00007f1c899c5700] [socket] - id=0x7f1c84000c90 fd=7: zero read, socket is closed [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [task-scheduler] - id=0x7f1c84001210: Scheduling channel_shutdown task for immediate execution [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [channel] - id=0x7f1c84001010: Channel shutdown is already pending, not scheduling another. [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [task-scheduler] - id=0x7f1c84001210: Running channel_shutdown task with <Running> status [DEBUG] [2023-01-27T15:21:47Z] [00007f1c899c5700] [channel] - id=0x7f1c84001010: beginning shutdown process
Reproduction Steps
-
Client establishes a connection with a mosquitto server:
mqtt_connection = mqtt_connection_builder.mtls_from_path( endpoint=host, port=port, cert_filepath=certificate, pri_key_filepath=privateKey, ca_filepath=rootCA, on_connection_interrupted=on_connection_interrupted, on_connection_resumed=on_connection_resumed, client_id=MQTTclientID, clean_session=False, keep_alive_secs=30, http_proxy_options=proxy_options) connect_future = mqtt_connection.connect() connection_result = connect_future.result() -
Restart the mosquitto server (terminates connection abruptly from server end)
Possible Solution
No response
Additional Information/Context
No response
SDK version used
awscrt==0.16.4, awsiotsdk==1.12.2
Environment details (OS name and version, etc.)
Ubuntu 20.04
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17 (8 by maintainers)
Looks like it’s slightly different than what I originally thought. There’s a fundamental mistake in the reconnect time calculations. Oncall is looking at it now and rough guess is it should be fairly quick to address, but then we’ll need to do a release of the CRT and SDKs.