amazon-ssm-agent: ssm agent failing to start after reboot
We are applying patches to our Windows instances using the patch manager function in AWS Systems Manager. We have a patch baseline that is executed against a set of windows instances (each of which are part of a patch group) by executing a maintenance window which in turn executes a run command(AWS-RunPatchBaseline) against each of the instances. However we are finding the following:
The instances in question seem to get patches installed correctly. Executing wmic qfe list
shows that the patches have been installed on the target machines
The target instances are then rebooted after patches are installed
The run command remains in progress indefinitely
From more investigation we found that the amazon-ssh-agent failed to start when the instances are rebooted. Looking at event logs shows a timeout occured:
Get-WinEvent -ProviderName 'Service Control Manager'
Output:
09/11/2020 14:25:56 7000 Error The AmazonSSMAgent service failed to start due to the following error: …
09/11/2020 14:25:56 7009 Error A timeout was reached (30000 milliseconds) while waiting for the AmazonSSMAgent service to connect.
Once we manually restarted the amazon-ssh-agent again the run command completed successfully. This issue is we do not want to have to manually start the amazon-ssh-agent on each instance especially as we have a lot of instances. This suggests that it is not an issue with Persistent Routes either and I have just double checked:
Instance IP 10.1.3.217
Persistent Routes:
Network Address Netmask Gateway Address Metric
169.254.169.254 255.255.255.255 10.1.3.1 15
169.254.169.250 255.255.255.255 10.1.3.1 15
169.254.169.251 255.255.255.255 10.1.3.1 15
169.254.169.249 255.255.255.255 10.1.3.1 15
169.254.169.123 255.255.255.255 10.1.3.1 15
169.254.169.253 255.255.255.255 10.1.3.1 15
Any ideas on what is causing this, i.e. why is the amazon-ssh-agent not starting up successfully after automatic reboot?
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 3
- Comments: 18 (3 by maintainers)
Someone appears to have a solution https://github.com/shirou/gopsutil/issues/570
@Thor-Bjorgvinsson @Praba-N gotta poke someone at AWS to look into it.
Also looks like it was potentially fixed in https://github.com/aws/amazon-ssm-agent/commit/12d1ec4ae31951314ff03c8b4c12866e7321ba30
but I don’t know crap about golang…
Also here is why AWS will likely not be able to solve this issue => https://github.com/golang/go/issues/23479
@noelmcgrath I went down this road with AWS SSM engineering through support, but ultimately doing 2 things helped solve this issue.
Now why does this happen? I am not 100% sure, but I know it’s related to fun CPU stuff windows does on startup + and golang’s windows service package startup behavior. Why do I suspect golang? Well Datadog’s go-based agent was failing the exact same way. Same timeout message and not starting behavior on EC2 instances.
I would make sure your agents are up to latest since they appear to have addressed some of the agent halting issues, possibly by updating the golang win svc package to latest?