moby: Restarting Windows node results in HNS failed with error: Element not found

What’s the problem/feature request?

If a Windows node has a running Service connected to an overlay network, and that Windows node is restarted completely, the Service is unable to restart due to HNS failed with error: Element not found. Upon further investigation, it seems that the overlay network is unable to get re-created on the Windows node after the restart. The Engine logs from the node are as follows:

 7725 8/18/2017 9:09:33 PM HNSNetwork Request ={"Name":"sn1ugevc1q854fwzpykm60kwf","Type":"overlay","Subnets":[{"AddressPrefix":"10.0.3.0/24","GatewayAddress":"10.0.3.1","Policies":[{"Type":"VSID","VSID":4101}]}]}                                                                                            
 7726 8/18/2017 9:09:33 PM fatal task error [module=node/agent/taskmanager node.id=9v56djidhwy0ib848ees5f5re task.id=r54gbirkg7lkv0sviavtzhjyl service.id=ljitgqg02nj13kwuiqojqjtyp error=HNS failed with error : Element not found. ]                                                                           
 7727 8/18/2017 9:09:33 PM failed to deactivate service binding for container iis.1.w32r9obb9f8n30ju4d9qfyrt6 [module=node/agent node.id=9v56djidhwy0ib848ees5f5re error=No such container: iis.1.w32r9obb9f8n30ju4d9qfyrt6] 

The iis.1.w32r9obb9f8n30ju4d9qfyrt6 container was the Service container that was previously running prior to the restart.

In on instance, Windows VM is running in Azure and the restart was not an in-guest OS restart, but rather an Azure VM Restart (Deallocate). This and Azure VM Shutdown (Deallocate) are common tasks used to stop VM billing from occurring so as to save on cost.

Others have seen this after proper restart as well.

Who is asking for this?

Multiple people within Docker, and at least one customer.

What is the (customer) Impact of this request or severity of the issue?

If customers are unable to reliably restart Windows nodes and expect service continuity after restart, then this becomes a problem.

One customer’s effort to build hybrid cluster is halted because of this.

What versions/components are affected?

UCP 2.2 Docker 17.06.1-ee-1 Windows Server 2016 10.0.14393 Build 14393 (all latest patches applied)

Checklist

  • KB KB4015217 is installed. Verified using Powershell [System.Version]::Parse((Get-Item $env:SystemRoot\System32\HostNetSvc.dll).VersionInfo.ProductVersion)
  • Attempt to clean up VMSwitch with Get-VMSwitch | Where-Object {$_.SwitchType -eq “External”} | Remove-VMSwitch Get-ContainerNetwork | Remove-ContainerNetwork did not help.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 32 (14 by maintainers)

Most upvoted comments

I’ve seen a few suggestions that a workaround is available. I’d like to try it as I am experiencing the problem in one of my Azure environments. Please share the workaround procedure.

We have a final Fix now. Planning to release the KB on 11/21