AutoSpotting: Investigate and document memory leaks and resource requirements for large installations

Github issue

Reported by @symonds on Gitter

Issue type

  • Bug
  • Documentation fix

Build number

nightly-468

Configuration

any

Environment

Large AWS accounts, with > 500 instances per region

Summary

On large AWS accounts, where API throttling is quite common the Lambda function has been found to hang and timeout runs.

Steps to reproduce

  • Enable it on such a large account
  • Trigger API throttling by performing many API calls

Expected results

AutoSpotting should gracefully handle this situation

Actual results

  • The Lambda run timed out, seemed to hang after Processing page 1 of DescribeInstancesPages
  • Lambda was running out of the allocated memory
  • Spot Instances were launched but not tagged properly so they failed to be added to the group
  • Increasing the memory allocation seemed to help, 3GB Lambda seemed to work fine.
  • Runs that didn’t time out only consumed ~200MB of memory, so this seems related to the throttling.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 18 (4 by maintainers)

Most upvoted comments

As I think I’ve hit everything in the list so far. I wanted to link my latest issue and the only one I haven’t been able to work around so far: https://github.com/cristim/autospotting/issues/199

By the way I do plan on documenting this stuff I experienced so that others can work around it. I very much appreciate all the work you guys have done with this solution and I’m hoping I can get it fully implemented.

@symonds BTW, I’d like to hear about how people use AutoSpotting in the wild, please fill this survey if you have some time: https://www.surveymonkey.de/r/5GYJSJJ

I think this issue can be closed now.

I’m running autospotting against 200 ASG’s which all scale up from 0 at the same time every day. Replacing every single instance with autospotting. Max memory 1024, Max Runtime 300s.

No errors in months, Max duration of 42s

Retested this week, using the tag filtering to limit it to an environment of 50 ASG’s in an account with 350 instances and 200 ASG’s

Still getting the odd 300s run, some times consuming all the allocated memory, sometimes not. Good news was no untagged instances left orphaned

At one point it created 6 identical instances (one by one), all tagged but never attached them to an ASG, this was when it was only trying to replace one on-demand instance (Instances came up as unhealthy due to github issues)

I was thinking maybe a potential good way to solve this would be to have some micro-service? I know it’s a buzz-word, but we could have 2 lambda functions:

  1. That would scan the ASG + ASG instance / spot instances and send this as a JSON event
  2. That would treat the JSON event in order to apply if need be modifications to them

The “planner/fetcher” and the “worker” would have significantly different logic and data to work with, we could have multiple workers running in parallel at the Lambda function level instead of goroutines. Thus making the code easier while still being in parallel, and potentially also easier to test with events.

If the infra was really big, we could even consider splitting that into region, as we would have more code to maintain.