syne-tune: SageMaker ResourceLimitExceeded

Hi, I have a limit of 8 ml.g5.12xlarge instances, and although I set Tuner.n_workers = 5 I still got a ResourceLimitExceeded error. Is there a way to make sure that jobs are fully stopped when using SageMakerBackend before launching new ones?

Also, when using RemoteLauncher, in situations where the management instance does error out (for example due to ResourceLimitExceeded), is there a way to make sure the management instance sends a stop signal to all tuning jobs before exiting? Maybe something like:

try:
    # manage tuning jobs
except:
   # raise error
finally:
   # stop any trials still running

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 15 (5 by maintainers)

Commits related to this issue

Back-end reports on number of busy workers. Fixes issue #250 — committed to awslabs/syne-tune by mseeger 2 years ago
Back-end reports on number of busy workers. Fixes issue #250 — committed to awslabs/syne-tune by mseeger 2 years ago
Back-end reports on number of busy workers. Fixes issue #250 — committed to awslabs/syne-tune by mseeger 2 years ago
Back-end reports on number of busy workers. Fixes issue #250 — committed to awslabs/syne-tune by mseeger 2 years ago
Back-end reports on number of busy workers. Fixes issue #250 — committed to awslabs/syne-tune by mseeger 2 years ago
Refactor backend (#389) * Back-end reports on number of busy workers. Fixes issue #250 * Fix — committed to awslabs/syne-tune by mseeger 2 years ago
Refactor backend (#389) * Back-end reports on number of busy workers. Fixes issue #250 * Fix — committed to awslabs/syne-tune by mseeger 2 years ago

Most upvoted comments

Regarding your first point, I agree that it would be good to have this functionality as an option (it should probably be an option as I would assume most of the time users does not want to wait the instance to be released as it can take several minutes).

Regarding the second point, this behavior should already be implemented as the tuner should stop all jobs before exciting even when an error occur, see [here].

geoalgo on May 31, 2022

This is solved by #389

mseeger on Oct 18, 2022

This issue remains a problem due to long stop times for SM training jobs. This will likely be overcome, for our use case here, by SageMaker warmpooling, which we will support shortly. For this reason, I am closing this issue for now, but we are aware of this, and our goal is to make sure that you can use just as many workers with the SageMaker back-end as you have quotas to run.

mseeger on Oct 13, 2022

Hi David, we may be able to do something about this. The issue is that once the scheduler returns STOP, a trial is marked as stopped (which is the right thing to do, because the scheduler assumes that), but then of course it may take some time for the backend to really get the resource back.

I think it would be hard for the Tuner to figure out every time how many resources (for new trials) are really available. But given that the backend’s start_trial (or, more precisely, _schedule) returns a status whether the new trial could really be scheduled, I think it is not hard to work out a clean solution. This would include a suitable timeout if a user really does not have the required quotas, so that certain trials new get started.

mseeger on Jun 6, 2022