iotedge: Modules never reconnect after updating from 1.0.x to 1.1.x
Expected Behavior
Modules reconnect to edgeHub
after the new 1.1 container has started.
Current Behavior
All our modules fail and never reconnect to the edgeHub
. The silent failure does not exit the container, so we never self heal.
Steps to Reproduce
Update security daemon
- SSH to host
- Follow instructions on how to update security daemon
- When installing new versions, use command
sudo apt-get install iotedge=1.1.1* libiothsm-std=1.1.1*
Manually create host directory for edgeHub
mkdir -p /var/lib/edgeHub && \
sudo chown 1000 /var/lib/edgeHub && \
sudo chmod /var/lib/edgeHub
Deploy new edge container versions
We have an edge deployment schema in git, and we just simply sed
our versions and apply the schema. Here’s a redacted version:
{
"content": {
"modulesContent": {
"$edgeAgent": {
"properties.desired": {
"schemaVersion": "1.1",
"runtime": {
"type": "docker",
"settings": {
"minDockerVersion": "v1.25",
"loggingOptions": "", // Defined on a module basis
"registryCredentials": {
"aregistry": {
"username": "aRegistry",
"password": "<A_REGISTRY_PASSWORD>",
"address": "aregistry.azurecr.io"
}
}
}
},
"systemModules": {
"edgeAgent": {
"type": "docker",
"settings": {
"image": "mcr.microsoft.com/azureiotedge-agent:1.1",
"createOptions": "{\"HostConfig\":{\"Binds\":[\"/var/lib/edgeAgent/:/tmp/edgeAgent\"],\"LogConfig\":{\"Type\":\"json-file\",\"Config\":{\"max-size\":\"10m\",\"max-file\":\"3\"}}}}"
}
},
"edgeHub": {
"type": "docker",
"status": "running",
"restartPolicy": "always",
"startupOrder": 1,
"settings": {
"image": "mcr.microsoft.com/azureiotedge-hub:1.1",
"createOptions": "{\"HostConfig\":{\"Binds\":[\"/var/lib/edgeHub/:/tmp/edgeHub\"],\"PortBindings\":{\"5671/tcp\":[{\"HostPort\":\"5671\"}],\"8883/tcp\":[{\"HostPort\":\"8883\"}],\"443/tcp\":[{\"HostPort\":\"443\"}]},\"LogConfig\":{\"Type\":\"json-file\",\"Config\":{\"max-size\":\"10m\",\"max-file\":\"3\"}}}}"
}
}
},
"modules": {
// 5 of our own modules
}
}
},
"$edgeHub": {
"properties.desired": {
"schemaVersion": "1.1",
"routes": {
"upstream": {
"route": "FROM /messages/* INTO $upstream",
"priority": 0,
"timeToLiveSecs": 31536000
}
},
"storeAndForwardConfiguration": {
"timeToLiveSecs": 30
}
}
}
}
}
}
This is then deployed using from bitbucket using:
az iot edge deployment create --hub-name $HUB_NAME --deployment-id edge-deployment-$BITBUCKET_BUILD_NUMBER --content deployment-manifest.json --priority 0 --target-condition "tags.environment = 'dev' AND tags.type = 'type'" --subscription $AZURE_SUBSCRIPTION_ID
Context (Environment)
Output of iotedge check
Click here
$ sudo iotedge check
Configuration checks
--------------------
√ config.yaml is well-formed - OK
√ config.yaml has well-formed connection string - OK
√ container engine is installed and functional - OK
√ config.yaml has correct hostname - OK
√ config.yaml has correct URIs for daemon mgmt endpoint - OK
√ latest security daemon - OK
√ host time is close to real time - OK
√ container time is close to host time - OK
‼ DNS server - Warning
Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub.
Please see https://aka.ms/iotedge-prod-checklist-dns for best practices.
You can ignore this warning if you are setting DNS server per module in the Edge deployment.
‼ production readiness: certificates - Warning
The Edge device is using self-signed automatically-generated development certificates.
They will expire in 23 days (at 2021-05-23 12:22:11 UTC) causing module-to-module and downstream device communication to fail on an active deployment.
After the certs have expired, restarting the IoT Edge daemon will trigger it to generate new development certs.
Please consider using production certificates instead. See https://aka.ms/iotedge-prod-checklist-certs for best practices.
√ production readiness: container engine - OK
‼ production readiness: logs policy - Warning
Container engine is not configured to rotate module logs which may cause it run out of disk space.
Please see https://aka.ms/iotedge-prod-checklist-logs for best practices.
You can ignore this warning if you are setting log policy per module in the Edge deployment.
√ production readiness: Edge Agent's storage directory is persisted on the host filesystem - OK
√ production readiness: Edge Hub's storage directory is persisted on the host filesystem - OK
Connectivity checks
-------------------
√ host can connect to and perform TLS handshake with IoT Hub AMQP port - OK
√ host can connect to and perform TLS handshake with IoT Hub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with IoT Hub MQTT port - OK
√ container on the default network can connect to IoT Hub AMQP port - OK
√ container on the default network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the default network can connect to IoT Hub MQTT port - OK
√ container on the IoT Edge module network can connect to IoT Hub AMQP port - OK
√ container on the IoT Edge module network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to IoT Hub MQTT port - OK
20 check(s) succeeded.
3 check(s) raised warnings. Re-run with --verbose for more details.
Device Information
- Host OS Ubuntu 18.04.5 LTS
- Architecture amd64
- Container OS Linux containers
Runtime Versions
- aziot-edged 1.1.1
- Edge Agent 1.1 rolling tag
- Edge Hub 1.1 rolling tag
- Docker/Moby
- Client 19.03.13+azure
- Server
- Engine 19.03.13+azure
- containerd 1.3.7+azure
- runc 1.0.0-rc92
- docker-init 0.18.0
Logs
our container logs
2021-04-28 09:25:53,268 [azure-iot-sdk-ReactorRunner] ERROR c.m.a.s.i.d.t.a.AmqpsIotHubConnection.onCBSSessionClosedUnexpectedly(720) - Amqp CBS session closed unexpectedly. Closing this connection...
com.microsoft.azure.sdk.iot.device.transport.amqps.exceptions.AmqpLinkDetachForcedException: Server busy, please retry operation
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsExceptionTranslator.convertFromAmqpException(AmqpsExceptionTranslator.java:38)
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection.onCBSSessionClosedUnexpectedly(AmqpsIotHubConnection.java:719)
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsCbsSessionHandler.onLinkClosedUnexpectedly(AmqpsCbsSessionHandler.java:143)
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsSenderLinkHandler.onLinkRemoteClose(AmqpsSenderLinkHandler.java:116)
at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:176)
at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:291)
at com.microsoft.azure.sdk.iot.device.transport.amqps.IotHubReactor.run(IotHubReactor.java:25)
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection$ReactorRunner.call(AmqpsIotHubConnection.java:1115)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:835)
2021-04-28 09:25:53,287 [azure-iot-sdk-ConnectionReconnectionTask:13738d06-ec74-4b6b-8ca5-ae604ed5a453] INFO s.i.l.c.i.e.c.ConnectionChangeCallback.execute(36) - EdgeModule connection status changed to 'DISCONNECTED_RETRYING'. Reason:
NO_NETWORK
2021-04-28 09:25:53,594 [azure-iot-sdk-ReactorRunner] ERROR c.m.a.s.i.d.t.a.AmqpsIotHubConnection.onTransportError(486) - Amqp transport error occurred, closing the amqps connection
com.microsoft.azure.sdk.iot.device.transport.amqps.exceptions.ProtonIOException: Connection refused
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsExceptionTranslator.convertFromAmqpException(AmqpsExceptionTranslator.java:84)
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection.onTransportError(AmqpsIotHubConnection.java:484)
at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:191)
at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:291)
at com.microsoft.azure.sdk.iot.device.transport.amqps.IotHubReactor.run(IotHubReactor.java:25)
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection$ReactorRunner.call(AmqpsIotHubConnection.java:1115)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:835)
2021-04-28 09:25:54,085 [azure-iot-sdk-ReactorRunner] ERROR c.m.a.s.i.d.t.a.AmqpsIotHubConnection.onTransportError(486) - Amqp transport error occurred, closing the amqps connection
com.microsoft.azure.sdk.iot.device.transport.amqps.exceptions.ProtonIOException: Connection refused
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsExceptionTranslator.convertFromAmqpException(AmqpsExceptionTranslator.java:84)
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection.onTransportError(AmqpsIotHubConnection.java:484)
at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:191)
at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:291)
at com.microsoft.azure.sdk.iot.device.transport.amqps.IotHubReactor.run(IotHubReactor.java:25)
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection$ReactorRunner.call(AmqpsIotHubConnection.java:1115)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:835)
2021-04-28 09:25:54,992 [azure-iot-sdk-ReactorRunner] ERROR c.m.a.s.i.d.t.a.AmqpsIotHubConnection.onTransportError(486) - Amqp transport error occurred, closing the amqps connection
com.microsoft.azure.sdk.iot.device.transport.amqps.exceptions.ProtonIOException: Connection refused
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsExceptionTranslator.convertFromAmqpException(AmqpsExceptionTranslator.java:84)
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection.onTransportError(AmqpsIotHubConnection.java:484)
at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:191)
at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:291)
at com.microsoft.azure.sdk.iot.device.transport.amqps.IotHubReactor.run(IotHubReactor.java:25)
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection$ReactorRunner.call(AmqpsIotHubConnection.java:1115)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:835)
2021-04-28 09:25:56,781 [azure-iot-sdk-ReactorRunner] ERROR c.m.a.s.i.d.t.a.AmqpsIotHubConnection.onTransportError(486) - Amqp transport error occurred, closing the amqps connection
com.microsoft.azure.sdk.iot.device.transport.amqps.exceptions.ProtonIOException: Connection refused
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsExceptionTranslator.convertFromAmqpException(AmqpsExceptionTranslator.java:84)
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection.onTransportError(AmqpsIotHubConnection.java:484)
at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:191)
at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:291)
at com.microsoft.azure.sdk.iot.device.transport.amqps.IotHubReactor.run(IotHubReactor.java:25)
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection$ReactorRunner.call(AmqpsIotHubConnection.java:1115)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:835)
2021-04-28 09:25:59,539 [azure-iot-sdk-ReactorRunner] ERROR c.m.a.s.i.d.t.a.AmqpsIotHubConnection.onTransportError(486) - Amqp transport error occurred, closing the amqps connection
com.microsoft.azure.sdk.iot.device.transport.amqps.exceptions.ProtonIOException: Connection refused
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsExceptionTranslator.convertFromAmqpException(AmqpsExceptionTranslator.java:84)
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection.onTransportError(AmqpsIotHubConnection.java:484)
at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:191)
at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:291)
at com.microsoft.azure.sdk.iot.device.transport.amqps.IotHubReactor.run(IotHubReactor.java:25)
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection$ReactorRunner.call(AmqpsIotHubConnection.java:1115)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:835)
2021-04-28 09:26:06,148 [azure-iot-sdk-ReactorRunner] ERROR c.m.a.s.i.d.t.a.AmqpsIotHubConnection.onTransportError(486) - Amqp transport error occurred, closing the amqps connection
com.microsoft.azure.sdk.iot.device.transport.amqps.exceptions.ProtonIOException: Connection refused
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsExceptionTranslator.convertFromAmqpException(AmqpsExceptionTranslator.java:84)
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection.onTransportError(AmqpsIotHubConnection.java:484)
at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:191)
at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:291)
at com.microsoft.azure.sdk.iot.device.transport.amqps.IotHubReactor.run(IotHubReactor.java:25)
at com.microsoft.azure.sdk.iot.device.transport.amqps.AmqpsIotHubConnection$ReactorRunner.call(AmqpsIotHubConnection.java:1115)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:835)
Note: I cannot reproduce the error and cannot supply these logs again without updating another device from 1.0.x to 1.1 That said, when this occurred both edgeAgent and edgeHub started without any issues, it was only my containers that did not reconnect.
aziot-edged logs
<Paste here between the triple backticks>
edge-agent logs
<Paste here between the triple backticks>
edge-hub logs
<Paste here between the triple backticks>
Additional Information
We connect our modules like this in Java:
ModuleClient.createFromEnvironment(IotHubClientProtocol.AMQPS);
Usually the runtime is resilient against disconnects, like pulling the ethernet cord, disabling interfaces. I’ve never seen our modules not reconnect before.
This would have self healed if the error produced made our containers exit.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (8 by maintainers)
I updated the security daemon, rolled out an edge deployment with a new rolling tag 1.1 (previously rolling tag 1.0) for edge runtime containers. That made all of our custom modules disconnected from the edgeHub. To see if it would reconnect automatically, I waited 20 minutes before manually restarting one of our custom modules. A forced restart of one our custom modules and it connected to the edgeHub once again. Containers not forced to restart remained disconnected.
To see if an automatic reconnect could be forced by removing the edgeHub, I ran
sudo docker rm -f edgeHub
after those 20 minutes had passed. No change in connection status, but our custom modules produced additional logs. So nothing is removing our edgeHub container except for the new edge deployment where I specifically want them replaced with the new version.All our custom modules are running the
restartPolicy
of always. The issue here is that I have never seen our modules receive a disconnect and then never reconnect. This error does not cause our containers to crash, if they would have crashed we would have recovered from this due to the restart policy.I am 90% sure that when we roll out new edge deployments to 1.0.10, all containers get the same uptime (e.g. they all restart). I cannot verify this right now, but I could try it out later this evening, we don’t have any dev 1.0.x left.
I tried to issue a new edge deployment to one of our newly updated systems running 1.1, to see what the behavior was there. By issue new edgeployment I mean:
I deployed the exact same versions as before, I just created a new deployment with the same versions. Unfortunately, new deployments also breaks everything, I’ve never seen this issue pre 1.1.
Before deployment
After deployment
And unfortunately, I see the same disconnect here and we never recover, see logs below after new deployment received on target device:
edgeAgent
edgeHub
I have waited > 10 minutes for this to recover
One of our modules
Tell me if you need anything else from me
I performed an update according to the steps outlined in my first post, and here are the results.
sudo docker rm -f edgeHub
and all my containers disconnectsudo docker restart mycontainer
and the container connectssudo systemctl restart iotedge
So when receiving the new 1.1.1 edgeAgent container I do not get any SIGTERM for my containers, they just simply remain on as can be seen here by running
iotedge list
:edgeAgent after new edge deployment
edgeHub after new edge deployment
Our container after new edge deployment (same version, only new edge containers)
Our containers after sudo docker restart
One of our containers that did not reconnect gets a different error message after `sudo docker rm -f edgeHub` (EdgeModule connection status changed to 'DISCONNECTED_RETRYING'. Reason: NO_NETWORK)
@and-rewsmith here’s the rest of the info:
sudo docker rm -f edgeHub
, but our containers eventually reconnect (see below). I actually found a log snippet from an upgrade I performed but not from the edge runtime containers unfortunately. Note that I might not have waited long enough, snippet from upgrade scenario is only 2 minutes while I had to wait 10 for a reconnect after forcibly removing the container. The errors do look different though:Amqp transport error occurred, closing the amqps connection com.microsoft.azure.sdk.iot.device.transport.amqps.exceptions.ProtonIOException: null
Amqp CBS session closed unexpectedly. Closing this connection... com.microsoft.azure.sdk.iot.device.transport.amqps.exceptions.AmqpLinkDetachForcedException: Server busy, please retry operation
com.microsoft.azure.sdk.iot:iot-device-client:1.29.0
com.microsoft.azure.sdk.iot.provisioning:provisioning-device-client:1.4.0
com.microsoft.azure.sdk.iot.provisioning.security:x509-provider:1.1.3
/tmp/edgeHub
from theedgeHub
container to this directory on the host.Logs
One of our containers when upgrading from 1.0.x to 1.1.x
All logs here are from after running
sudo docker rm -f edgeHub
One of our containers
edgeAgent
edgeHub