sagemaker-python-sdk: Tensorflowmodel points to images that do not exist

Please fill out the form below.

System Information

  • Tensorflow:
  • Fails for all versions:
  • *Fails for py3 and py2:
  • Fails for CPU and GPU:
  • No custom image:

Describe the problem

If I try to deploy a pre-built model like so:

sagemaker_model = TensorFlowModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model0100.tar.gz',
                                  role = role,
                                  framework_version='1.13', py_version='py3',
                                  entry_point = 'train.py')

Will fail upon deploying:

predictor = sagemaker_model.deploy(initial_instance_count=1,
                                   instance_type='ml.p2.xlarge')

I receive:

ValueError: Error hosting endpoint sagemaker-tensorflow-2019-07-07-11-50-45-473: Failed Reason:  The image '520713654638.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-tensorflow:1.13-gpu-py3' does not exist.

I can get past this error by specifying the image (which is not well-documented - took a lot of digging to find a link that worked):

sagemaker_model = TensorFlowModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model0100.tar.gz',
                                  role = role,
                                  framework_version='1.13', py_version='py3',
                                  entry_point = 'train.py', image = '763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:1.13-gpu' )

Any idea how to solve this?

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 5
  • Comments: 16 (5 by maintainers)

Most upvoted comments

Just some context.

There are two TensorFlow solutions that handle serving in the Python SDK.

They have different class representations and documentation as shown here.

  1. TensorFlowModel - https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/model.py#L47 Doc: https://github.com/aws/sagemaker-python-sdk/tree/v1.12.0/src/sagemaker/tensorflow#deploying-directly-from-model-artifacts Key difference: Uses a proxy GRPC client to sent requests Container impl: https://github.com/aws/sagemaker-tensorflow-container/blob/master/src/tf_container/serve.py

  2. Model - https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/serving.py#L96 Doc: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst Key difference: Utilizes the TensorFlow serving rest API Container impl: https://github.com/aws/sagemaker-tensorflow-serving-container/blob/master/container/sagemaker/serve.py

Python 3 isn’t supported using the TensorFlowModel object, as the container uses the TensorFlow serving api library in conjunction with the GRPC client to handle making inferences, however the TensorFlow serving api isn’t supported in Python 3 officially, so there are only Python 2 versions of the containers when using the TensorFlowModel object.

If you need Python 3 then you will need to use the Model object defined in #2 above. The inference script format will change if you need to handle pre and post processing. https://github.com/aws/sagemaker-tensorflow-serving-container#prepost-processing.

Also your inference requests will need to follow the TFS rest API. https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#making-predictions-against-a-sagemaker-endpoint

Since you train externally you’re going to need to make sure your model artifacts follow the correct format. https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#deploying-more-than-one-model-to-your-endpoint

Here is an example that does for the most part what you’re trying to do. https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_serving_container/tensorflow_serving_container.ipynb

Sorry for the confusion and wall of text and links. Please let me know if there is anything I can clarify.

Thanks!

@abdelhamednouh you’re commenting on an old, closed issue with an unrelated error message - can you open a new issue?

@ChoiByungWook Thanks for your introduction! I am wondering when will tf 1.14 be supported for serving?

I tried cpu, gpu and elastic ones, but it seems the corresponding images are all not available:

The image '763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:1.14-cpu' does not exist.

The image '763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference:1.14-gpu' does not exist.

I used your second one:

from sagemaker import get_execution_role
from sagemaker.tensorflow.serving import Model
role = get_execution_role()

sagemaker_model = Model(model_data = 's3://sagemaker-hover/Models/zulu/tpu/model.tar.gz',
                        role = role,
                        framework_version='1.14')
predictor = sagemaker_model.deploy(initial_instance_count=1, 
                                   instance_type='ml.p2.xlarge',
                                   endpoint_name='test-001')

And also for the TensorFlowModel module, it seems it only supports until tf 1.12.

Hi @yuchuang1979 ,

Precisely what I am referring to. I am trying to deploy a model I trained elsewhere. You can also specify the image to solve the problem. My point, however, is that the default is pointing to the wrong docker image. It’s a bug.

Best, Noah

@NoahDolev @ChuyangDeng I met the same error when I follow this link: https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/ to deploy a pre-trained model in SageMaker with a different model. Since I am using py3 in my model, so I have to specify the image like this:

`sagemaker_model = TensorFlowModel(model_data = ‘s3://’ + sagemaker_session.default_bucket() + ‘/model/model.tar.gz’, role = role, py_version=‘py3’, framework_version = ‘1.12’, entry_point = ‘train.py’)

predictor = sagemaker_model.deploy(initial_instance_count=1, instance_type=‘ml.p2.xlarge’)`

ValueError: Error hosting endpoint sagemaker-tensorflow-2019-07-10-05-06-02-075: Failed Reason: The image ‘520713654638.dkr.ecr.us-east-2.amazonaws.com/sagemaker-tensorflow:1.12-gpu-py3’ does not exist.

When I delete py_version=‘py3’ there is no error anymore.