iotedge: Unable to recover from `Previously failed in deployment xx in prior attempt. Not running command`

Expected Behavior

EdgeAgent should recover and apply new deployment regardless of how many failed attempts

Current Behavior

EdgeAgent fails x times, then refuse to reattempt. See logs from EdgeAgent:

<6> 2023-03-27 07:26:37.583 +00:00 [INF] - Plan execution ended for deployment 18
<6> 2023-03-27 07:26:45.493 +00:00 [INF] - Plan execution started for deployment 18
<6> 2023-03-27 07:26:45.493 +00:00 [INF] - Executing command: "Prepare module proxy"
<6> 2023-03-27 07:26:57.150 +00:00 [INF] - Executing command: "Prepare module status-dashboard"
<3> 2023-03-27 07:27:06.344 +00:00 [ERR] - Step previously failed in deployment 18 on prior attempt. Not running command Prepare module api. Skipping remaining commands in deployment.
<6> 2023-03-27 07:27:06.344 +00:00 [INF] - Plan execution ended for deployment 18

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. Slow internet, some of the new modules in new deployment fail to download, probably due to timeouts
  2. EdgeAgent will then refuse to reattempt download of this container after unknown amount of failed attempts
  3. Need to restart EdgeAgent in order to reattempt download of new deployment

Context (Environment)

  • Low bandwidth, high ping environment using satellite internet

  • ModuleUpdateMode = WaitForAllPulls

Device Information

  • Host OS: Ubuntu 20.04
  • Architecture: amd64
  • Container OS: Linux

Runtime Versions

  • aziot-edged [run iotedge version]: 1.1.15
  • Edge Agent [image tag (e.g. 1.0.0)]: 1.4.6
  • Edge Hub [image tag (e.g. 1.0.0)]: 1.4.6
  • Docker/Moby [run docker version]: 20.10.17+azure-1

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (9 by maintainers)

Most upvoted comments

We have created action items to be reviewed based on our priorities for this issue. So closing this issue now.

I donโ€™t think it is possible to retry forever as per design. I understand your concern. We are backward compatible with 1.1+ but full support is limited. So I would recommend moving to 1.4 LTS

Thank you, I trust that you bring this bug to whomever that can implement a better solution for it ๐Ÿ‘๐Ÿผ

We should advise to update IoT Edge runtime to 1.4 to match the containers, right?

yes or if they cannot get the runtime to update they can still change edge agent images right?