sagemaker-python-sdk: Model deploy instance_type modification failed

Reference:

0409412934
MLFW-1638

System Information

Framework / Algorithm: PCA
Framework Version: SageMaker Aug 12, 2019 09:03 UTC
Python Version: 3
CPU or GPU: CPU
Python SDK Version:
Are you using a custom image: No

Describe the problem

Running SageMaker example: PCA for MNIST If a user try to deploy a model with an instance_type that is not available, the user won’t be able to simply replace the instance_type and deploy again.

Minimal repro / logs

While running PCA for MNIST in the example project, and executing the following script:

pca_predictor = pca.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

When a user doesn’t have any extra resource to launch new instance with the assigned instance_type. The user will get a ResourceLimitExceeded error. If the user desides to change the instance_type and execute the script again.

pca_predictor = pca.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

The the user will get this error:

--------------------------------------------------------------------------- 
ResourceLimitExceeded Traceback (most recent call last) 
<ipython-input-18-b2bb257b120c> in <module>() 
1 pca_predictor = pca.deploy(initial_instance_count=1, 
----> 2 instance_type='ml.t2.medium') 
...... 
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateEndpoint operation: The account-level service limit 'ml.m4.xlarge for endpoint usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances... 
---------------------------------------------------------------------------

From this error message, we can see that although we are trying to set a new instance_type, the instance_type is not really reset.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 27 (26 by maintainers)

Most upvoted comments

Yeah, we can close this issue. Feel free to reopen/create a new issue if further issues arise 😃

laurenyu on Jul 10, 2020

I’ve merged changes that were released as part of v2.0.0.rc1 to address this issue.

laurenyu on Jul 10, 2020

getting an error while I was deploying the trained model. It is just failing in between.

AkashShukla199 on Jun 5, 2020

Hello @Wei-1,

@mvsusp just recently left Amazon.

I’ll take over this issue now. I’ll begin by looking at #1070 .

Reference:

0409412934
MLFW-1638

ChoiByungWook on Oct 14, 2019

I will answer your observations in the PR. Thanks.

mvsusp on Sep 30, 2019

Thanks for start working on this PR @Wei-1 . Let us know if you have any doubts.

mvsusp on Sep 29, 2019

You are right. I believe that the issue is that session.create_model tries to create a new model with the same previous name and fail - see https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/session.py#L774

One possible solution to fix this issue is to always generate models with a new name, perhaps in the format of name + ‘_’ + timestamp.

Any contribution will be highly appreciated. Thanks @Wei-1

mvsusp on Sep 12, 2019

Hi @Wei-1,

Thanks for reporting this bug. I am adding to our roadmap changes in the behaviour of pca.deploy fixing your use case. We will update this ticket as soon as the changes are pushed.

Thanks for using SageMaker!

mvsusp on Sep 11, 2019

@Wei-1 hmm, can you try specifying a new endpoint name?

laurenyu on Sep 9, 2019

@Wei-1,

Thank you for the clarification.

Can you do me a quick favor and check your endpoint configurations in the AWS console?

AWS Console -> Amazon SageMaker -> Endpoint configurations

Or you can use the cli: https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-endpoint-configs.html

Can you tell me if you see any corresponding endpoint configurations that have the expected “ml.t2.medium”, because based on the code links provided the session object should be propagating correctly.

Thanks!

ChoiByungWook on Aug 31, 2019

Hello @Wei-1,

Thank you for bringing this to our attention.

Let me look into this and figure out why the instance type isn’t being propagated correctly.

Thank you for your patience.

ChoiByungWook on Aug 30, 2019