amazon-ssm-agent: ssm agent failing to start after reboot

We are applying patches to our Windows instances using the patch manager function in AWS Systems Manager. We have a patch baseline that is executed against a set of windows instances (each of which are part of a patch group) by executing a maintenance window which in turn executes a run command(AWS-RunPatchBaseline) against each of the instances. However we are finding the following:

The instances in question seem to get patches installed correctly. Executing wmic qfe list shows that the patches have been installed on the target machines The target instances are then rebooted after patches are installed The run command remains in progress indefinitely From more investigation we found that the amazon-ssh-agent failed to start when the instances are rebooted. Looking at event logs shows a timeout occured:

Get-WinEvent -ProviderName 'Service Control Manager'

Output:

09/11/2020 14:25:56           7000 Error            The AmazonSSMAgent service failed to start due to the following error: …
09/11/2020 14:25:56           7009 Error            A timeout was reached (30000 milliseconds) while waiting for the AmazonSSMAgent service to connect.

Once we manually restarted the amazon-ssh-agent again the run command completed successfully. This issue is we do not want to have to manually start the amazon-ssh-agent on each instance especially as we have a lot of instances. This suggests that it is not an issue with Persistent Routes either and I have just double checked:

Instance IP 10.1.3.217

Persistent Routes:

Network Address          Netmask  Gateway Address  Metric
169.254.169.254  255.255.255.255         10.1.3.1      15
169.254.169.250  255.255.255.255         10.1.3.1      15
169.254.169.251  255.255.255.255         10.1.3.1      15
169.254.169.249  255.255.255.255         10.1.3.1      15
169.254.169.123  255.255.255.255         10.1.3.1      15
169.254.169.253  255.255.255.255         10.1.3.1      15

Any ideas on what is causing this, i.e. why is the amazon-ssh-agent not starting up successfully after automatic reboot?

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 3
  • Comments: 18 (3 by maintainers)

Most upvoted comments

Someone appears to have a solution https://github.com/shirou/gopsutil/issues/570

@Thor-Bjorgvinsson @Praba-N gotta poke someone at AWS to look into it.

Also looks like it was potentially fixed in https://github.com/aws/amazon-ssm-agent/commit/12d1ec4ae31951314ff03c8b4c12866e7321ba30

but I don’t know crap about golang…

Also here is why AWS will likely not be able to solve this issue => https://github.com/golang/go/issues/23479

@noelmcgrath I went down this road with AWS SSM engineering through support, but ultimately doing 2 things helped solve this issue.

  1. Increase windows service timeout from 30s default to 60s.
  2. Set Amazon SSM Agent service to automatic delayed start. (you could probably get away with just this if you wanted)

Now why does this happen? I am not 100% sure, but I know it’s related to fun CPU stuff windows does on startup + and golang’s windows service package startup behavior. Why do I suspect golang? Well Datadog’s go-based agent was failing the exact same way. Same timeout message and not starting behavior on EC2 instances.

I would make sure your agents are up to latest since they appear to have addressed some of the agent halting issues, possibly by updating the golang win svc package to latest?