iotedge: AuthenticationMode CloudAndScope fails to refresh service identity after ~1h

Hello,

I am a developer on the LoRaWan IoT Edge project. This issue is a reopening of #4999 as we are still facing it, I tried to port over important context and information.

Expected Behavior

Our module is impersonating devices and sending messages to the edgeHub on behalf of a DeviceClient but after a while it stops sending upstreams. DeviceClient still can send to edgeHub, but the messages there are dropped.

There seems to be a bug, setting “AuthenticationMode” environment variable to “CloudAndScope” works for a while sometimes just 1h and sometime for a few hours…. and then is failing, edgeHub is trying to recreate the connection but it fails (see error below) and after no more messages go upstream (the module still can send message to edgeHub, but they get dropped, see picture below), restarting edgeHub fixes for a while.

<6> 2021-10-11 12:55:45.382 +00:00 [INF] - Entering periodic task to reauthenticate connected clients
<6> 2021-10-11 12:55:45.382 +00:00 [INF] - Unable to authenticate client IotHubTestDevice with cached service identity IotHubTestDevice (Found: False). Resyncing service identity...
<6> 2021-10-11 12:55:45.383 +00:00 [INF] - Refreshing service identity for IotHubTestDevice
<4> 2021-10-11 12:55:45.888 +00:00 [WRN] - Error while refreshing the service identity: IotHubTestDevice OnBehalfOf: TestGateway
Microsoft.Azure.Devices.Edge.Hub.Core.DeviceInvalidStateException: Device not in scope: [Unauthorized: Message: Error getting device scope result from upstream, HttpStatusCode: Unauthorized, Content: {"Message":"ErrorCode:IotHubUnauthorizedAccess;Unauthorized","ExceptionMessage":"Tracking ID:24e00d57b069431baf48e6f58f8b1989-G:0-TimeStamp:10/11/2021 12:55:45"}].
 ---> Microsoft.Azure.Devices.Edge.Hub.CloudProxy.DeviceScopeApiException: Message: Error getting device scope result from upstream, HttpStatusCode: Unauthorized, Content: {"Message":"ErrorCode:IotHubUnauthorizedAccess;Unauthorized","ExceptionMessage":"Tracking ID:24e00d57b069431baf48e6f58f8b1989-G:0-TimeStamp:10/11/2021 12:55:45"}
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.NestedDeviceScopeApiClient.GetIdentityOnBehalfOfInternalAsync(Uri uri, String deviceId, Option`1 moduleId, String onBehalfOfDevice) in /home/vsts/work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/NestedDeviceScopeApiClient.cs:line 191
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter[T](Task`1 task, TimeSpan timeout) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 128
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.NestedDeviceScopeApiClient.GetIdentityOnBehalfOfWithRetry(Uri uri, String deviceId, Option`1 moduleId, String onBehalfOfDevice) in /home/vsts/work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/NestedDeviceScopeApiClient.cs:line 140
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ServiceProxy.GetServiceIdentity(String deviceId, String onBehalfOfDevice) in /home/vsts/work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ServiceProxy.cs:line 101
   --- End of inner exception stack trace ---
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ServiceProxy.GetServiceIdentity(String deviceId, String onBehalfOfDevice) in /home/vsts/work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ServiceProxy.cs:line 101
   at Microsoft.Azure.Devices.Edge.Hub.Core.DeviceScopeIdentitiesCache.RefreshServiceIdentityInternal(String refreshTarget, String onBehalfOfDevice, Boolean invokeServiceIdentitiesUpdated) in /home/vsts/work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.Core/DeviceScopeIdentitiesCache.cs:line 201
<6> 2021-10-11 12:55:45.891 +00:00 [INF] - Removed node: IotHubTestDevice

image

The problem occurs regularly in the LoRa IoT Edge Starterkit CI and triggered us to remove the IoT Edge queue usage from the CI. We have built a simple repro code that can retrigger the issue both on IoT Edge 1.1 and 1.2.

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. Build a new module with this code
  2. Replace the connection string with a correct value, do not forget to add the GatewayHostName
  3. Add the environment variable ‘AuthenticationMode’ set to ‘CloudAndScope’ on the egeHub
  4. Let the solution run for ~2 hours and see the above error being reported

Context (Environment)

Output of iotedge check


Configuration checks
--------------------
√ config.yaml is well-formed - OK
√ config.yaml has well-formed connection string - OK
√ container engine is installed and functional - OK
√ config.yaml has correct hostname - OK
√ config.yaml has correct URIs for daemon mgmt endpoint - OK
‼ latest security daemon - Warning
    Installed IoT Edge daemon has version 1.1.3 but 1.1.6 is the latest stable version available.
    Please see https://aka.ms/iotedge-update-runtime for update instructions.
√ host time is close to real time - OK
√ container time is close to host time - OK
‼ DNS server - Warning
    Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub.
    Please see https://aka.ms/iotedge-prod-checklist-dns for best practices.
    You can ignore this warning if you are setting DNS server per module in the Edge deployment.
‼ production readiness: certificates - Warning
    The Edge device is using self-signed automatically-generated development certificates.
    They will expire in 45 days (at 2021-11-27 13:07:01 UTC) causing module-to-module and downstream device communication to fail on an active deployment.
    After the certs have expired, restarting the IoT Edge daemon will trigger it to generate new development certs.
    Please consider using production certificates instead. See https://aka.ms/iotedge-prod-checklist-certs for best practices.
√ production readiness: container engine - OK
‼ production readiness: logs policy - Warning
    Container engine is not configured to rotate module logs which may cause it run out of disk space.
    Please see https://aka.ms/iotedge-prod-checklist-logs for best practices.
    You can ignore this warning if you are setting log policy per module in the Edge deployment.
‼ production readiness: Edge Agent's storage directory is persisted on the host filesystem - Warning
    The edgeAgent module is not configured to persist its /tmp/edgeAgent directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.
‼ production readiness: Edge Hub's storage directory is persisted on the host filesystem - Warning
    The edgeHub module is not configured to persist its /tmp/edgeHub directory on the host filesystem.
    Data might be lost if the module is deleted or updated.
    Please see https://aka.ms/iotedge-storage-host for best practices.

Connectivity checks
-------------------
√ host can connect to and perform TLS handshake with IoT Hub AMQP port - OK
√ host can connect to and perform TLS handshake with IoT Hub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with IoT Hub MQTT port - OK
√ container on the default network can connect to IoT Hub AMQP port - OK
√ container on the default network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the default network can connect to IoT Hub MQTT port - OK
√ container on the IoT Edge module network can connect to IoT Hub AMQP port - OK
√ container on the IoT Edge module network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to IoT Hub MQTT port - OK

17 check(s) succeeded.
6 check(s) raised warnings. Re-run with --verbose for more details.

Device Information

Host OS: Ubuntu 18.04 and Rasbian (happens on both) Architecture: amd64 and arm32: Container OS Linux containers Runtime Versions: aziot-edged 1.2.0 Edge Agent : 1.2 Edge Hub: 1.2 Docker/Moby : 20.10.6+azure (Also reproduced on iot edge 1.1)

Logs

IoT Edge Hub Logs:


<6> 2021-10-11 12:55:45.382 +00:00 [INF] - Entering periodic task to reauthenticate connected clients
<6> 2021-10-11 12:55:45.382 +00:00 [INF] - Unable to authenticate client IotHubTestDevice with cached service identity IotHubTestDevice (Found: False). Resyncing service identity...
<6> 2021-10-11 12:55:45.383 +00:00 [INF] - Refreshing service identity for IotHubTestDevice
<4> 2021-10-11 12:55:45.888 +00:00 [WRN] - Error while refreshing the service identity: IotHubTestDevice OnBehalfOf: TestGateway
Microsoft.Azure.Devices.Edge.Hub.Core.DeviceInvalidStateException: Device not in scope: [Unauthorized: Message: Error getting device scope result from upstream, HttpStatusCode: Unauthorized, Content: {"Message":"ErrorCode:IotHubUnauthorizedAccess;Unauthorized","ExceptionMessage":"Tracking ID:24e00d57b069431baf48e6f58f8b1989-G:0-TimeStamp:10/11/2021 12:55:45"}].
 ---> Microsoft.Azure.Devices.Edge.Hub.CloudProxy.DeviceScopeApiException: Message: Error getting device scope result from upstream, HttpStatusCode: Unauthorized, Content: {"Message":"ErrorCode:IotHubUnauthorizedAccess;Unauthorized","ExceptionMessage":"Tracking ID:24e00d57b069431baf48e6f58f8b1989-G:0-TimeStamp:10/11/2021 12:55:45"}
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.NestedDeviceScopeApiClient.GetIdentityOnBehalfOfInternalAsync(Uri uri, String deviceId, Option`1 moduleId, String onBehalfOfDevice) in /home/vsts/work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/NestedDeviceScopeApiClient.cs:line 191
   at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter[T](Task`1 task, TimeSpan timeout) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 128
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.NestedDeviceScopeApiClient.GetIdentityOnBehalfOfWithRetry(Uri uri, String deviceId, Option`1 moduleId, String onBehalfOfDevice) in /home/vsts/work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/NestedDeviceScopeApiClient.cs:line 140
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ServiceProxy.GetServiceIdentity(String deviceId, String onBehalfOfDevice) in /home/vsts/work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ServiceProxy.cs:line 101
   --- End of inner exception stack trace ---
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.ServiceProxy.GetServiceIdentity(String deviceId, String onBehalfOfDevice) in /home/vsts/work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/ServiceProxy.cs:line 101
   at Microsoft.Azure.Devices.Edge.Hub.Core.DeviceScopeIdentitiesCache.RefreshServiceIdentityInternal(String refreshTarget, String onBehalfOfDevice, Boolean invokeServiceIdentitiesUpdated) in /home/vsts/work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.Core/DeviceScopeIdentitiesCache.cs:line 201
<6> 2021-10-11 12:55:45.891 +00:00 [INF] - Removed node: IotHubTestDevice
edge-agent logs

<6> 2021-10-12 12:35:53.568 +00:00 [INF] - Scraped and Stored Metrics
<6> 2021-10-12 12:35:53.568 +00:00 [INF] - Successfully completed periodic operation Metrics Scrape
<6> 2021-10-12 13:35:50.391 +00:00 [INF] - Starting periodic operation refresh twin config...
<6> 2021-10-12 13:35:50.435 +00:00 [INF] - Obtained Edge agent twin from IoTHub with desired properties version 11 and reported properties version 54.
<6> 2021-10-12 13:35:50.441 +00:00 [INF] - Successfully completed periodic operation refresh twin config
<6> 2021-10-12 13:35:53.563 +00:00 [INF] - Starting periodic operation Metrics Scrape...
<6> 2021-10-12 13:35:53.563 +00:00 [INF] - Scraping Metrics
<6> 2021-10-12 13:35:53.563 +00:00 [INF] - Scraping endpoint http://edgeHub:9600/metrics
<6> 2021-10-12 13:35:53.567 +00:00 [INF] - Scraping endpoint http://edgeAgent:9600/metrics
<6> 2021-10-12 13:35:53.631 +00:00 [INF] - Storing Metrics
<6> 2021-10-12 13:35:53.680 +00:00 [INF] - Scraped and Stored Metrics
<6> 2021-10-12 13:35:53.680 +00:00 [INF] - Successfully completed periodic operation Metrics Scrape
<6> 2021-10-12 14:35:49.066 +00:00 [INF] - Starting compaction of stores
<6> 2021-10-12 14:35:49.066 +00:00 [INF] - Starting compaction of store deploymentConfig
<6> 2021-10-12 14:35:49.067 +00:00 [INF] - Starting compaction of store Metrics
<6> 2021-10-12 14:35:49.067 +00:00 [INF] - Starting compaction of store moduleState
<6> 2021-10-12 14:35:49.067 +00:00 [INF] - Starting compaction of store default
<6> 2021-10-12 14:35:50.442 +00:00 [INF] - Starting periodic operation refresh twin config...
<6> 2021-10-12 14:35:50.482 +00:00 [INF] - Obtained Edge agent twin from IoTHub with desired properties version 11 and reported properties version 54.
<6> 2021-10-12 14:35:50.491 +00:00 [INF] - Successfully completed periodic operation refresh twin config
<6> 2021-10-12 14:35:53.683 +00:00 [INF] - Starting periodic operation Metrics Scrape...
<6> 2021-10-12 14:35:53.683 +00:00 [INF] - Scraping Metrics
<6> 2021-10-12 14:35:53.683 +00:00 [INF] - Scraping endpoint http://edgeHub:9600/metrics
<6> 2021-10-12 14:35:53.686 +00:00 [INF] - Scraping endpoint http://edgeAgent:9600/metrics
<6> 2021-10-12 14:35:53.743 +00:00 [INF] - Storing Metrics
<6> 2021-10-12 14:35:53.769 +00:00 [INF] - Scraped and Stored Metrics
<6> 2021-10-12 14:35:53.769 +00:00 [INF] - Successfully completed periodic operation Metrics Scrape
<6> 2021-10-12 15:35:50.501 +00:00 [INF] - Starting periodic operation refresh twin config...
<6> 2021-10-12 15:35:50.574 +00:00 [INF] - Obtained Edge agent twin from IoTHub with desired properties version 11 and reported properties version 54.
<6> 2021-10-12 15:35:50.579 +00:00 [INF] - Successfully completed periodic operation refresh twin config
<6> 2021-10-12 15:35:53.770 +00:00 [INF] - Starting periodic operation Metrics Scrape...
<6> 2021-10-12 15:35:53.770 +00:00 [INF] - Scraping Metrics
<6> 2021-10-12 15:35:53.770 +00:00 [INF] - Scraping endpoint http://edgeHub:9600/metrics
<6> 2021-10-12 15:35:53.777 +00:00 [INF] - Scraping endpoint http://edgeAgent:9600/metrics
<6> 2021-10-12 15:35:53.852 +00:00 [INF] - Storing Metrics
<6> 2021-10-12 15:35:53.881 +00:00 [INF] - Scraped and Stored Metrics
<6> 2021-10-12 15:35:53.881 +00:00 [INF] - Successfully completed periodic operation Metrics Scrape
edge-hub logs

   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnectionProvider.ConnectInternal(IIdentity identity, Action`2 connectionStatusChangedHandler) in /home/vsts/work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnectionProvider.cs:line 239
<4> 2021-10-12 06:15:10.329 +00:00 [WRN] - Updating reported properties failed IotHubTestDevice
<4> 2021-10-12 06:15:11.259 +00:00 [WRN] - Error creating cloud connection for client IotHubTestDevice
System.InvalidOperationException: No auth chain for the client identity: IotHubTestDevice
   at Microsoft.Azure.Devices.Edge.Util.Option`1.Expect[TException](Func`1 exception) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/Option.cs:line 117
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnectionProvider.ConnectInternal(IIdentity identity, Action`2 connectionStatusChangedHandler) in /home/vsts/work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnectionProvider.cs:line 239
<6> 2021-10-12 06:15:11.261 +00:00 [INF] - Error getting cloud connection for device IotHubTestDevice
System.InvalidOperationException: No auth chain for the client identity: IotHubTestDevice
   at Microsoft.Azure.Devices.Edge.Util.Option`1.Expect[TException](Func`1 exception) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/Option.cs:line 117
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnectionProvider.ConnectInternal(IIdentity identity, Action`2 connectionStatusChangedHandler) in /home/vsts/work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnectionProvider.cs:line 239
<4> 2021-10-12 06:15:14.607 +00:00 [WRN] - Error creating cloud connection for client IotHubTestDevice
System.InvalidOperationException: No auth chain for the client identity: IotHubTestDevice
   at Microsoft.Azure.Devices.Edge.Util.Option`1.Expect[TException](Func`1 exception) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/Option.cs:line 117
   at Microsoft.Azure.Devices.Edge.Hub.CloudProxy.CloudConnectionProvider.ConnectInternal(IIdentity identity, Action`2 connectionStatusChangedHandler) in /home/vsts/work/1/s/edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.CloudProxy/CloudConnectionProvider.cs:line 239
<6> 2021-10-12 06:15:14.608 +00:00 [INF] - Error getting cloud connection for device IotHubTestDevice
System.InvalidOperationException: No auth chain for the client identity: IotHubTestDevice
   at Microsoft.Azure.Devices.Edge.Util.Option`1.Expect[TException](Func`1 exception) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/Option.cs:line 117

Additional Information

More information can be found in the original issue #4999.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 3
  • Comments: 15 (6 by maintainers)

Most upvoted comments

Hi @Mandur, sorry for the late answer.

I am reading the other thread and see this:

“Yes it is a leaf device (so a normal IoT Hub device) that is NOT a children of any IoT Edge (this is very important)”

Just to be sure: you create an edge device e.g. device_edge. And then create a leaf device called device_leaf. You create “device_leaf” the way that you don’t set a parent for it, especially not “device_edge”. Then you make “device_leaf” connected to “device_edge” which works for an hour and then you see the authentication errors and messages sent by “device_leaf” gets dropped. Is this the right summary?