syne-tune: SageMaker ResourceLimitExceeded
Hi, I have a limit of 8 ml.g5.12xlarge instances, and although I set Tuner.n_workers = 5 I still got a ResourceLimitExceeded error. Is there a way to make sure that jobs are fully stopped when using SageMakerBackend before launching new ones?
Also, when using RemoteLauncher, in situations where the management instance does error out (for example due to ResourceLimitExceeded), is there a way to make sure the management instance sends a stop signal to all tuning jobs before exiting? Maybe something like:
try:
# manage tuning jobs
except:
# raise error
finally:
# stop any trials still running
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (5 by maintainers)
Commits related to this issue
- Back-end reports on number of busy workers. Fixes issue #250 — committed to awslabs/syne-tune by mseeger 2 years ago
- Back-end reports on number of busy workers. Fixes issue #250 — committed to awslabs/syne-tune by mseeger 2 years ago
- Back-end reports on number of busy workers. Fixes issue #250 — committed to awslabs/syne-tune by mseeger 2 years ago
- Back-end reports on number of busy workers. Fixes issue #250 — committed to awslabs/syne-tune by mseeger 2 years ago
- Back-end reports on number of busy workers. Fixes issue #250 — committed to awslabs/syne-tune by mseeger 2 years ago
- Refactor backend (#389) * Back-end reports on number of busy workers. Fixes issue #250 * Fix — committed to awslabs/syne-tune by mseeger 2 years ago
- Refactor backend (#389) * Back-end reports on number of busy workers. Fixes issue #250 * Fix — committed to awslabs/syne-tune by mseeger 2 years ago
Regarding your first point, I agree that it would be good to have this functionality as an option (it should probably be an option as I would assume most of the time users does not want to wait the instance to be released as it can take several minutes).
Regarding the second point, this behavior should already be implemented as the tuner should stop all jobs before exciting even when an error occur, see [here].
This is solved by #389
This issue remains a problem due to long stop times for SM training jobs. This will likely be overcome, for our use case here, by SageMaker warmpooling, which we will support shortly. For this reason, I am closing this issue for now, but we are aware of this, and our goal is to make sure that you can use just as many workers with the SageMaker back-end as you have quotas to run.
Hi David, we may be able to do something about this. The issue is that once the scheduler returns STOP, a trial is marked as stopped (which is the right thing to do, because the scheduler assumes that), but then of course it may take some time for the backend to really get the resource back.
I think it would be hard for the Tuner to figure out every time how many resources (for new trials) are really available. But given that the backend’s start_trial (or, more precisely,
_schedule) returns a status whether the new trial could really be scheduled, I think it is not hard to work out a clean solution. This would include a suitable timeout if a user really does not have the required quotas, so that certain trials new get started.