sagemaker-inference-toolkit: Error in batch transform with custom image
Describe the problem
I needed to add GluonCV library in my code environment, and since the Default MXNet container does not have the python package, I needed to create a custom image with the python package installed.
I got the default MXNet container from here: https://github.com/aws/sagemaker-mxnet-serving-container and follow all the instructions. To include GluonCV, i then simply added this to the docker file and build the image
RUN ${PIP} install --no-cache-dir mxnet-mkl==$MX_VERSION \
mxnet-model-server==$MMS_VERSION \
keras-mxnet==2.2.4.1 \
numpy==1.14.5 \
gluoncv \
onnx==1.4.1 \
...
I build the image, then uploaded it to a AWS ECR.
I am able to verify that the docker image has been successfully uploaded and I have a valid URI like so:
552xxxxxxx.dkr.ecr.us-west-2.amazonaws.com/preprod-mxnet-serving:1.4.1-cpu-py3
THEN, when instantiating the MXNet model, I added a reference to this image URI like so
sagemaker_model = MXNetModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/yolo_object_person_detector.tar.gz',
role = role,
entry_point = 'entry_point.py',
image = '552xxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/preprod-mxnet-serving:1.4.1-cpu-py3',
py_version='py3',
framework_version='1.4.1',
sagemaker_session = sagemaker_session)
BUT i got an error message: Here is the full log
Traceback (most recent call last):
File "/usr/local/bin/dockerd-entrypoint.py", line 21, in <module>
serving.main()
File "/usr/local/lib/python3.6/site-packages/sagemaker_mxnet_serving_container/serving.py", line 54, in main
_start_model_server()
File "/usr/local/lib/python3.6/site-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/local/lib/python3.6/site-packages/retrying.py", line 206, in call
return attempt.get(self._wrap_exception)
File "/usr/local/lib/python3.6/site-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/local/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/local/lib/python3.6/site-packages/sagemaker_mxnet_serving_container/serving.py", line 49, in _start_model_server
model_server.start_model_server(handler_service=HANDLER_SERVICE)
File "/usr/local/lib/python3.6/site-packages/sagemaker_inference/model_server.py", line 63, in start_model_server
'/dev/null'])
File "/usr/local/lib/python3.6/subprocess.py", line 287, in call
with Popen(*popenargs, **kwargs) as p:
File "/usr/local/lib/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/local/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 14] Bad address: 'tail'
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (7 by maintainers)
Hi @ChoiByungWook ,
I have some updates regarding the bug. I found some inconsistencies even with the default MXNet Serving container.
Here is the command:
I ran the same script three times and here are the results & error messages:
FIRST RUN
SECOND RUN
THIRD RUN SUCCESS
The error is inconsistent. I suspect it’s something to do with delay and timers in the server code as you have previously mentioned.
@velociraptor111,
Gotcha, thanks for the information.
Looks like there are two problems, one for the tail call and one for requirements.txt.
I’ll start with the tail call, since that can potentially cause jobs to fail regardless of the dependencies in them or not.