AutoSpotting: Investigate and document memory leaks and resource requirements for large installations
Github issue
Reported by @symonds on Gitter
Issue type
- Bug
- Documentation fix
Build number
nightly-468
Configuration
any
Environment
Large AWS accounts, with > 500 instances per region
Summary
On large AWS accounts, where API throttling is quite common the Lambda function has been found to hang and timeout runs.
Steps to reproduce
- Enable it on such a large account
- Trigger API throttling by performing many API calls
Expected results
AutoSpotting should gracefully handle this situation
Actual results
- The Lambda run timed out, seemed to hang after
Processing page 1 of DescribeInstancesPages
- Lambda was running out of the allocated memory
- Spot Instances were launched but not tagged properly so they failed to be added to the group
- Increasing the memory allocation seemed to help, 3GB Lambda seemed to work fine.
- Runs that didn’t time out only consumed ~200MB of memory, so this seems related to the throttling.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 18 (4 by maintainers)
As I think I’ve hit everything in the list so far. I wanted to link my latest issue and the only one I haven’t been able to work around so far: https://github.com/cristim/autospotting/issues/199
By the way I do plan on documenting this stuff I experienced so that others can work around it. I very much appreciate all the work you guys have done with this solution and I’m hoping I can get it fully implemented.
@symonds BTW, I’d like to hear about how people use AutoSpotting in the wild, please fill this survey if you have some time: https://www.surveymonkey.de/r/5GYJSJJ
I think this issue can be closed now.
I’m running autospotting against 200 ASG’s which all scale up from 0 at the same time every day. Replacing every single instance with autospotting. Max memory 1024, Max Runtime 300s.
No errors in months, Max duration of 42s
Retested this week, using the tag filtering to limit it to an environment of 50 ASG’s in an account with 350 instances and 200 ASG’s
Still getting the odd 300s run, some times consuming all the allocated memory, sometimes not. Good news was no untagged instances left orphaned
At one point it created 6 identical instances (one by one), all tagged but never attached them to an ASG, this was when it was only trying to replace one on-demand instance (Instances came up as unhealthy due to github issues)
I was thinking maybe a potential good way to solve this would be to have some micro-service? I know it’s a buzz-word, but we could have 2 lambda functions:
The “planner/fetcher” and the “worker” would have significantly different logic and data to work with, we could have multiple workers running in parallel at the Lambda function level instead of goroutines. Thus making the code easier while still being in parallel, and potentially also easier to test with events.
If the infra was really big, we could even consider splitting that into region, as we would have more code to maintain.