aws-iot-device-sdk-python-v2: After Linux suspend->resume aws data are not send to cloud. Socket error after ~20min

Confirm by changing [ ] to [x] below to ensure that it’s a bug:

Known Issue

  • [x I’m using ATS data type endpoint: the endpoint should look like <prefix>-ats.iot.<region>.amazonaws.com

Describe the bug Hi guys. We are running aws app on imx7ulp from NXP with LTE modem. Once we will suspend Linux and resume again the LTE interface will disappear for a while. But it will appear again with connection to the internet. But aws will be not able to sent any messages after internet is up again. It looks like no callbacks like: on_connection_interrupted and on_connection_resumed are called.

With eth0 it is happing too but not as often as with LTE.

Based on observation it looks like AWS_IO_SOCKET_TIMEOUT is thrown after ~20min. could be this Socket timeout edited to lower value? Also based on erros it looks like there are some issues to create socket, but at the end there is some created and used.

Before suspend 18 messages were sent. And after resume 257 message was sent. (256-18)*5s(sleep between msgs)/60 = ~20min

Interface and internet is up in 50s after resume.

SDK version number 1.3.0

Platform/OS/Device Linux kernel 4.19 - imx7ulp Arm® Cortex®-A7

To Reproduce (observed behavior) `

provLoc = "/prov"
crtLoc = "/certs"
AWS_CONNECT_DELAY_SECS = 10
AWS_KEEP_ALIVE_SECS = 3*60*60  # 3hours
client_id = "abc"
endpoint = "....-ats.iot.us-east-2.amazonaws.com"
aws_topic = "/d/"+client_id

# setup AWS connection
io.init_logging(io.LogLevel.Warn, 'stderr')
event_loop_group = io.EventLoopGroup(1)
host_resolver = io.DefaultHostResolver(event_loop_group)
client_bootstrap = io.ClientBootstrap(event_loop_group, host_resolver)

aws_mqtt = mqtt_connection_builder.mtls_from_path(
    endpoint=endpoint,
    cert_filepath="%s/device.crt" % (provLoc),
    pri_key_filepath="%s/device.key" % (provLoc),
    client_bootstrap=client_bootstrap,
    ca_filepath="%s/AmazonRootCA1.pem" % (crtLoc),
    on_connection_interrupted=aws_on_connection_interrupted,
    on_connection_resumed=aws_on_connection_resumed,
    client_id=client_id,
    clean_session=False,
    keep_alive_secs=AWS_KEEP_ALIVE_SECS)

# connect loop
connected = False
while not connected:
    try:
        print("[aws] Connecting to %s as %s ..." % (endpoint, client_id))
        connect_future = aws_mqtt.connect()
        connect_future.result()
        connected = True
    except:
        print("[aws] Failed connection: %s" % (sys.exc_info()[0]))
    sleep(AWS_CONNECT_DELAY_SECS)

print("[aws] Connected!")

# subscribe to AWS topic
print("[aws] Subscribing to topic '%s'..." % (aws_topic))
subscribe_future, packet_id = aws_mqtt.subscribe(
    topic=aws_topic,
    qos=mqtt.QoS.AT_LEAST_ONCE,
    callback=aws_on_message_received)
subscribe_result = subscribe_future.result()
print("[aws] Subscribed with %s" % (str(subscribe_result['qos'])))

i = 0
while True:
    LAST_PUBLISH_RESULT, _ = aws_mqtt.publish(topic=aws_topic,
                                              payload=str(i),
                                              qos=mqtt.QoS.AT_LEAST_ONCE)
    sleep(5)
    if not LAST_PUBLISH_RESULT.done():
        LAST_PUBLISH_RESULT.cancel()
        print("Cancel aws send request. Timeout.")
    else:
        print("Done...")
    i += 1

`

Expected behavior On laptop: When i disable interface and enable again it will send all messages which were not sent successfully.

Logs/output If applicable, add logs or error output. `

[aws] Connecting to …iot.us-east-2.amazonaws.com as abc … [WARN ] [2020-07-14T19:19:21Z] [b51fe460] [socket] - id=0xb4809ba8 fd=11: setsockopt() for NO_SIGNAL failed with errno 92. If you are having SIGPIPE signals thrown, you may want to install a signal trap in your application layer. [ERROR] [2020-07-14T19:19:21Z] [b51fe460] [socket] - id=0xb4809ba8 fd=11: connect failed with error code 101. [ERROR] [2020-07-14T19:19:21Z] [b51fe460] [channel-bootstrap] - id=0x1b22ec0: failed to create socket with error 1049 [WARN ] [2020-07-14T19:19:21Z] [b51fe460] [socket] - id=0xb4809ba8 fd=11: setsockopt() for NO_SIGNAL failed with errno 92. If you are having SIGPIPE signals thrown, you may want to install a signal trap in your application layer. [aws] Connected! [aws] Subscribing to topic ‘aws-data’… [aws] Subscribed with QoS.AT_LEAST_ONCE [aws] Received message from topic ‘aws-data’: b’0’ Done… [aws] Received message from topic ‘aws-data’: b’1’ Done… … [aws] Received message from topic ‘aws-data’: b’18’ s[ 1004.892006] PM: suspend entry (deep) [ 1004.903644] PM: Syncing filesystems … done. [ 1004.942160] Freezing user space processes … (elapsed 0.005 seconds) done. [ 1004.955471] OOM killer disabled. [ 1004.958729] Freezing remaining freezable tasks … (elapsed 0.002 seconds) done. [ 1004.968407] Suspending console(s) (use no_console_suspend to debug) [ 1005.140539] fsl-lpuart 402e0000.serial: ttyLP1: Unable to drain transmitter [ 1005.141117] PM: suspend devices took 0.170 seconds [ 1005.245416] PM: resume devices took 0.090 seconds [ 1005.262232] OOM killer enabled. [ 1005.265421] Restarting tasks … done. [ 1005.360987] PM: suspend exit [ERROR] [2020-07-14T19:19:48Z] [b473f460] [dns] - static: getaddrinfo failed with error_code -3 [ERROR] [2020-07-14T19:19:49Z] [b473f460] [dns] - static: getaddrinfo failed with error_code -3 Done… … [ERROR] [2020-07-14T19:19:51Z] [b473f460] [dns] - static: getaddrinfo failed with error_code -3 [ERROR] [2020-07-14T19:19:52Z] [b473f460] [dns] - static: getaddrinfo failed with error_code -3 [ERROR] [2020-07-14T19:19:53Z] [b473f460] [dns] - static: getaddrinfo failed with error_code -3 [ERROR] [2020-07-14T19:19:54Z] [b473f460] [dns] - static: getaddrinfo failed with error_code -3 Cancel aws send request. Timeout. Cancel aws send request. Timeout. Cancel aws send request. Timeout. … [ERROR] [2020-07-14T19:39:29Z] [b51fe460] [socket] - id=0xb4809ba8 fd=11: socket timed out. [aws] Connection interrupted. error: AwsCrtError(name=‘AWS_IO_SOCKET_TIMEOUT’, message=‘socket operation timed out.’, code=1048) [WARN ] [2020-07-14T19:39:30Z] [b51fe460] [socket] - id=0xb4809ba8 fd=11: setsockopt() for NO_SIGNAL failed with errno 92. If you are having SIGPIPE signals thrown, you may want to install a signal trap in your application layer. [ERROR] [2020-07-14T19:39:30Z] [b51fe460] [socket] - id=0xb4809ba8 fd=11: connect failed with error code 101. [ERROR] [2020-07-14T19:39:30Z] [b51fe460] [channel-bootstrap] - id=0x1b22ec0: failed to create socket with error 1049 [WARN ] [2020-07-14T19:39:30Z] [b51fe460] [socket] - id=0xb4809ba8 fd=11: setsockopt() for NO_SIGNAL failed with errno 92. If you are having SIGPIPE signals thrown, you may want to install a signal trap in your application layer. [aws] Connection resumed. return_code: ConnectReturnCode.ACCEPTED session_present: True Done… [aws] Received message from topic ‘aws-data’: b’257’ Done…

`

Additional context: On laptop with Kubuntu 20.04 i have same output during connect (aws sdk 1.3.1): WARN ] [2020-07-14T20:17:30Z] [00007fca11ace700] [socket] - id=0x7fca04004ca0 fd=6: setsockopt() for NO_SIGNAL failed with errno 92. If you are having SIGPIPE signals thrown, you may want to install a signal trap in your application layer. [ERROR] [2020-07-14T20:17:30Z] [00007fca11ace700] [socket] - id=0x7fca04004ca0 fd=6: connect failed with error code 101. [ERROR] [2020-07-14T20:17:30Z] [00007fca11ace700] [channel-bootstrap] - id=0x23e51b0: failed to create socket with error 1049 [WARN ] [2020-07-14T20:17:30Z] [00007fca11ace700] [socket] - id=0x7fca04004ca0 fd=6: setsockopt() for NO_SIGNAL failed with errno 92. If you are having SIGPIPE signals thrown, you may want to install a signal trap in your application layer. [aws] Connected!

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 18 (6 by maintainers)

Most upvoted comments

I’m commenting based on an issue I raised here for the AWS C++ IoT SDK.

With regard to @graebm’s comment:

MQTT vs TCP keep-alive

If you’re worried about data usage, there’s almost no different between using MQTT and TCP keepalive. The MQTT ping packet is 2 bytes, and the response from the server is also 2 bytes. Each of these is wrapped in TCP/IPV4/Ethernet framing which add another 64+ bytes.

A TCP keep-alive packet will not have the MQTT data so it will be 2 bytes smaller: 64 bytes instead of 66.

A couple of points to make around data consumption.

  1. The above behaviour is not what I observe. I observe the following behaviour:
  • For TCP Keep-alive: a keep-alive packet is sent (68 bytes according to wireshark) and a response is received (68 byes). Total of 136 bytes.
  • For MQTT Ping: A PINGREQ is sent (99 bytes, this will have TLS header and TLS footer and maybe some TLS padding as well, see here), then a PINGRESP is received (also 99 bytes) and then a TCP ACK is sent (68 bytes). Total of 266 bytes.
  1. This is compounded by the problem that it appears the MQTT sdk keep-alive blindly sends a ping every x seconds. While the TCP keep-alive sends a keep-alive packet after x seconds, but only when there has been no traffic for x seconds. Which means if we are sending small payloads frequently the MQTT keep-alive may send a PINGREQ just after the MQTT payload is sent, but the connection does not need a “keep alive” packet at that moment, so wasted data.

In a nutshell we are currently using TCP keep-alive over a cellular connection and it has significantly reduced our data consumption, around 30 to 40% reduction in data consumption, and the same keep-alive.

Hi @rccarper , i was able to try your suggestion. Still same behavior. Based on datetimestamp socket error came after 22min:

2020-07-17 16:10:29.537898> [aws] Connected!
2020-07-17 16:10:29.538471> [aws] Subscribing to topic '/d/8e45f6f05ab9ee9662101772906c4beedf082401'...
2020-07-17 16:10:29.754099> [aws] Subscribed with QoS.AT_LEAST_ONCE
2020-07-17 16:10:29.942506> [aws] Received message from topic '/d/8e45f6f05ab9ee9662101772906c4beedf082401': b'0'
2020-07-17 16:10:34.860585> Done...0
2020-07-17 16:10:35.049977> [aws] Received message from topic '/d/8e45f6f05ab9ee9662101772906c4beedf082401': b'1'
2020-07-17 16:10:39.980838> Done...1
2020-07-17 16:10:40.176548> [aws] Received message from topic '/d/8e45f6f05ab9ee9662101772906c4beedf082401': b'2'
2020-07-17 16:10:58.475283> Done...2
[ERROR] [2020-07-17T16:10:44Z] [b46bf460] [dns] - static: getaddrinfo failed with error_code -3
[ 1384.874952] PM: suspend entry (deep)
connect: Network is unreachable
2020-07-17 16:10:45.139601> network not up
[ 1384.878613] PM: Syncing filesystems ... done.
[ 1385.028325] Freezing user space processes ... (elapsed 0.006 seconds) done.
[ 1385.042059] OOM killer disabled.
[ 1385.045331] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[ 1385.055034] Suspending console(s) (use no_console_suspend to debug)
[ 1385.171712] fsl-lpuart 402e0000.serial: ttyLP1: Unable to drain transmitter
[ 1385.172305] PM: suspend devices took 0.120 seconds
[ 1385.276644] PM: resume devices took 0.090 seconds
[ 1385.293582] OOM killer enabled.
[ 1385.296788] Restarting tasks ... done.
[ 1385.345961] PM: suspend exit
[ERROR] [2020-07-17T16:10:48Z] [b46bf460] [dns] - static: getaddrinfo failed with error_code -3
[ERROR] [2020-07-17T16:10:50Z] [b46bf460] [dns] - static: getaddrinfo failed with error_code -3
[ERROR] [2020-07-17T16:10:51Z] [b46bf460] [dns] - static: getaddrinfo failed with error_code -3
[ERROR] [2020-07-17T16:10:52Z] [b46bf460] [dns] - static: getaddrinfo failed with error_code -3
connect: Network is unreachable
2020-07-17 16:10:58.312625> network up
2020-07-17 16:11:03.633953> Aws send request. Timeout.3
2020-07-17 16:11:08.834384> Aws send request. Timeout.3
2020-07-17 16:11:13.995835> Aws send request. Timeout.3
2020-07-17 16:11:19.116620> Aws send request. Timeout.3
2020-07-17 16:11:24.236439> Aws send request. Timeout.3
...
2020-07-17 16:32:39.822748> Aws send request. Timeout.3
2020-07-17 16:32:44.941931> Aws send request. Timeout.3
2020-07-17 16:32:49.472105> [aws] Connection interrupted. error: AwsCrtError(name='AWS_IO_SOCKET_TIMEOUT', message='socket operation timed out.', code=1048)
[ERROR] [2020-07-17T16:32:49Z] [b51fe460] [socket] - id=0xb480a1c0 fd=11: socket timed out.
[WARN ] [2020-07-17T16:32:49Z] [b51fe460] [socket] - id=0xb480a1c0 fd=11: setsockopt() for NO_SIGNAL failed with errno 92. If you are having SIGPIPE signals thrown, you may want to install a signal trap in your application layer.
[ERROR] [2020-07-17T16:32:49Z] [b51fe460] [socket] - id=0xb480a1c0 fd=11: connect failed with error code 101.
[ERROR] [2020-07-17T16:32:49Z] [b51fe460] [channel-bootstrap] - id=0xe6a700: failed to create socket with error 1049
[WARN ] [2020-07-17T16:32:49Z] [b51fe460] [socket] - id=0xb480a1c0 fd=11: setsockopt() for NO_SIGNAL failed with errno 92. If you are having SIGPIPE signals thrown, you may want to install a signal trap in your application layer.
2020-07-17 16:32:50.061718> Done...3
2020-07-17 16:32:50.434261> [aws] Connection resumed. return_code: ConnectReturnCode.ACCEPTED session_present: True
2020-07-17 16:32:50.648095> [aws] Received message from topic '/d/8e45f6f05ab9ee9662101772906c4beedf082401': b'4'
2020-07-17 16:32:55.182068> Done...4
2020-07-17 16:32:55.374028> [aws] Received message from topic '/d/8e45f6f05ab9ee9662101772906c4beedf082401': b'5'
2020-07-17 16:33:00.302221> Done...5