iotedge: DeviceNotFoundException occuring at set intervals when using method invocations between modules

Expected Behavior

When module1 invokes module2 it should always succeed when module2 is running.

Current Behavior

Every set interval a DeviceNotFoundException is thrown when module1 is trying to do a method invocation at module2. This set interval depends on the SasTokenTimeToLive setting which is provided as a second argument for the ModuleClient.CreateFromEnvironmentAsync method.

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. Setup an IoT edge environment in Visual Studio that has two C# modules.
  2. Create an infinite while-loop that invokes a method at the other running module
  3. Setup your module client by using ModuleClient.CreateFromEnvironmentAsync(settings, new() { SasTokenTimeToLive = TimeSpan.FromSeconds(10) }
  4. Keep running for ~half an hour
  5. (For me at 10 minutes) Note how a DeviceNotFoundException is suddenly thrown while no changes were made.

Context (Environment)

The IoT edge modules were C# modules which ran .NET 6 and used Microsoft.Azure.Devices.Client version 1.41.3.

I’ve ran this test in a development environment using Visual Studio 2022 and the Azure Iot Edge extensions for visual studio 2022. In here the containers ran on Linux containers.

This issue also occurs in our production environment which are Windows machines running windows containers, otherwise using the same setup (C# modules and Microsoft.Azure.Devices.Client version 1.41.3)

Output of iotedge check

iotedge check Configuration checks -------------------- √ config.yaml is well-formed - OK √ config.yaml has well-formed connection string - OK √ container engine is installed and functional - OK √ Windows host version is supported - OK √ config.yaml has correct hostname - OK √ config.yaml has correct URIs for daemon mgmt endpoint - OK √ latest security daemon - OK √ host time is close to real time - OK √ container time is close to host time - OK √ DNS server - OK ‼ production readiness: certificates - Warning The Edge device is using self-signed automatically-generated development certificates. They will expire in 21 days (at 2022-12-28 15:11:07 UTC) causing module-to-module and downstream device communication to fail on an active deployment. After the certs have expired, restarting the IoT Edge daemon will trigger it to generate new development certs. Please consider using production certificates instead. See https://aka.ms/iotedge-prod-checklist-certs for best practices. √ production readiness: container engine - OK √ production readiness: logs policy - OK ‼ production readiness: Edge Agent's storage directory is persisted on the host filesystem - Warning The edgeAgent module is not configured to persist its C:\radEdge\edgeAgent directory on the host filesystem. Data might be lost if the module is deleted or updated. Please see https://aka.ms/iotedge-storage-host for best practices. ‼ production readiness: Edge Hub's storage directory is persisted on the host filesystem - Warning The edgeHub module is not configured to persist its C:\radEdge\edgeHub directory on the host filesystem. Data might be lost if the module is deleted or updated. Please see https://aka.ms/iotedge-storage-host for best practices.

Connectivity checks

√ host can connect to and perform TLS handshake with DPS endpoint - OK √ host can connect to and perform TLS handshake with IoT Hub AMQP port - OK √ host can connect to and perform TLS handshake with IoT Hub HTTPS / WebSockets port - OK √ host can connect to and perform TLS handshake with IoT Hub MQTT port - OK √ container on the IoT Edge module network can connect to IoT Hub AMQP port - OK √ container on the IoT Edge module network can connect to IoT Hub HTTPS / WebSockets port - OK √ container on the IoT Edge module network can connect to IoT Hub MQTT port - OK

19 check(s) succeeded. 3 check(s) raised warnings. Re-run with --verbose for more details.

Device Information

  • Host OS [e.g. Ubuntu 18.04, Windows Server IoT 2019]: Windows Server IoT 2019
  • Architecture [e.g. amd64, arm32, arm64]: Windows (amd64)
  • Container OS [e.g. Linux containers, Windows containers]: Both

Runtime Versions

  • aziot-edged [run iotedge version]: iotedge 1.1.15 (01685778d8320f126f2daf2f86a78a3d9d4b2f2c)
  • Edge Agent [image tag (e.g. 1.0.0)]: 1.1.15
  • Edge Hub [image tag (e.g. 1.0.0)]: 1.1.15
  • Docker/Moby [run docker version]: 19.03.12+azure

Note: when using Windows containers on Windows, run docker -H npipe:////./pipe/iotedge_moby_engine version instead

Logs

I was unable to extract any logs using the iotedge support-bundle command.

Additional Information

None as of yet.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (8 by maintainers)

Most upvoted comments

@brotherneeru thanks again for all your assistance. I was able to verify that using the AMQP resolves this issue.

You might want to update the Iot Edge docs troubleshoot docs with this difference in behavior to help future people who run into the same problem

In addition, enabling debug logs in edgeHub or even capturing the sdk logs in your module could help us get to the bottom of the problem

I am unable to do this right now, because I am missing a key member of our team today. I will get back to you for this next monday.

Hopefully though, you will find the example above clear enough to reproduce the issue.

Yes, it should. I would presume if there was some issue in edgehub id see it occuring with all of the other modules you use… I am attempting to reproduce this issue as well. If you could share the module code you used in your reproduction using the simulator that would be great

See the attachment below. You can run this solution and you will see after 10 minutes that you will get a DeviceNotFoundException. At this moment module 2 will log something including the delay it took for this to occur.

Note how the ModuleClient instances are configured with a very small SasTokenTimeToLive. We actually use the default settings in our production environment. If we do this with this exact setup, the same will happen again, but the intervals are longer (approx. 60 minutes).

edit: I will get back to you with additional logs from the edgeHub and sdk logs.

ReproducingDeviceNotFoundException.zip

@sandervanteinde And just to make sure, does iotedge logs edgeHub as well as iotedge logs edgeAgent fail to generate their logs?

It does not, I can access those logs. Would you like to see those?

I could also try to fetch the support bundle from one of our other devices, maybe it works there (it also faces the same issue)