sagemaker-inference-toolkit: Error in batch transform with custom image

Describe the problem

I needed to add GluonCV library in my code environment, and since the Default MXNet container does not have the python package, I needed to create a custom image with the python package installed.

I got the default MXNet container from here: https://github.com/aws/sagemaker-mxnet-serving-container and follow all the instructions. To include GluonCV, i then simply added this to the docker file and build the image

RUN ${PIP} install --no-cache-dir mxnet-mkl==$MX_VERSION \
                                  mxnet-model-server==$MMS_VERSION \
                                  keras-mxnet==2.2.4.1 \
                                  numpy==1.14.5 \
				  gluoncv \
                                  onnx==1.4.1 \
                                  ...

I build the image, then uploaded it to a AWS ECR.

I am able to verify that the docker image has been successfully uploaded and I have a valid URI like so: 552xxxxxxx.dkr.ecr.us-west-2.amazonaws.com/preprod-mxnet-serving:1.4.1-cpu-py3

THEN, when instantiating the MXNet model, I added a reference to this image URI like so

sagemaker_model = MXNetModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/yolo_object_person_detector.tar.gz',
                            role = role, 
                             entry_point = 'entry_point.py',
                             image = '552xxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/preprod-mxnet-serving:1.4.1-cpu-py3',
                             py_version='py3',
                             framework_version='1.4.1',
                            sagemaker_session = sagemaker_session)

BUT i got an error message: Here is the full log

Traceback (most recent call last):
File "/usr/local/bin/dockerd-entrypoint.py", line 21, in <module>
serving.main()
File "/usr/local/lib/python3.6/site-packages/sagemaker_mxnet_serving_container/serving.py", line 54, in main
_start_model_server()
File "/usr/local/lib/python3.6/site-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/local/lib/python3.6/site-packages/retrying.py", line 206, in call
return attempt.get(self._wrap_exception)
File "/usr/local/lib/python3.6/site-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/local/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/local/lib/python3.6/site-packages/sagemaker_mxnet_serving_container/serving.py", line 49, in _start_model_server
model_server.start_model_server(handler_service=HANDLER_SERVICE)
File "/usr/local/lib/python3.6/site-packages/sagemaker_inference/model_server.py", line 63, in start_model_server
'/dev/null'])
File "/usr/local/lib/python3.6/subprocess.py", line 287, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/local/lib/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/local/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)

OSError: [Errno 14] Bad address: 'tail'

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

Hi @ChoiByungWook ,

I have some updates regarding the bug. I found some inconsistencies even with the default MXNet Serving container.

Here is the command:

sagemaker_model = MXNetModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/yolo_object_person_detector.tar.gz',
                            role = role, 
                             entry_point = 'entry_point.py',
                             py_version='py3',
                             framework_version='1.4.1',
                            sagemaker_session = sagemaker_session)

transformer = sagemaker_model.transformer(instance_count=1, instance_type='ml.m4.xlarge', output_path=batch_output)

transformer.transform(data=batch_input, content_type='application/x-image')

transformer.wait()

I ran the same script three times and here are the results & error messages:

FIRST RUN

Traceback (most recent call last):
File "/usr/local/bin/dockerd-entrypoint.py", line 8, in <module>
serving.main()
File "/usr/local/lib/python3.6/site-packages/sagemaker_mxnet_serving_container/serving.py", line 42, in main
model_server.start_model_server(handler_service=HANDLER_SERVICE)
File "/usr/local/lib/python3.6/site-packages/sagemaker_inference/model_server.py", line 57, in start_model_server
mms_process = subprocess.Popen(mxnet_model_server_cmd)
File "/usr/local/lib/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/local/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)

OSError: [Errno 14] Bad address: 'mxnet-model-server'

SECOND RUN

Traceback (most recent call last):
File "/usr/local/bin/dockerd-entrypoint.py", line 8, in <module>
serving.main()
File "/usr/local/lib/python3.6/site-packages/sagemaker_mxnet_serving_container/serving.py", line 42, in main
model_server.start_model_server(handler_service=HANDLER_SERVICE)
File "/usr/local/lib/python3.6/site-packages/sagemaker_inference/model_server.py", line 63, in start_model_server
'/dev/null'])
File "/usr/local/lib/python3.6/subprocess.py", line 287, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/local/lib/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/local/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)

OSError: [Errno 14] Bad address: 'tail'

THIRD RUN SUCCESS

The error is inconsistent. I suspect it’s something to do with delay and timers in the server code as you have previously mentioned.

@velociraptor111,

Gotcha, thanks for the information.

Looks like there are two problems, one for the tail call and one for requirements.txt.

I’ll start with the tail call, since that can potentially cause jobs to fail regardless of the dependencies in them or not.