sagemaker-pytorch-inference-toolkit: model_fn ignored on PyTorch v1.6 serving container?
Describe the bug
PyTorch inference container at v1.6 seems to ignore the provided model_fn() and attempt to load a model.pth
file (non-existent in my case), resulting in an error for code which worked fine on the v1.4 container.
Maybe this is just a doc issue? I couldn’t see any indication of how this override should be provided differently in v1.6.
To reproduce
- Train a model in PyTorch framework container v1.6 and save it as a non-standard artifact (e.g. put it in a zip file inside the model.tar.gz, or something)
- Create a
PyTorchModel
with e.g.source_dir="src/", entry_point="src/inference.py"
, where the entry point script defines amodel_fn(model_dir: str)
- Try to run the model e.g. as a batch transform.
(Will see if there’s any public example I can link to to accelerate reproduction)
Expected behavior
The container calls the model_fn
per the SageMaker SDK docs and loads the model successfully.
Screenshots or logs
My job appears to have generated A LOT of duplicate log entries repeating the below, before eventually going quiet/hanging. Still shows as in-progress with 0 CPU utilization many minutes later - much longer than the typical for same dataset on previous PyTorch versions - so I forcibly stopped it.
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/model/model.pth'
Traceback (most recent call last):
File "/opt/conda/bin/torch-model-archiver", line 10, in <module>
sys.exit(generate_model_archive())
File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging.py", line 60, in generate_model_archive
package_model(args, manifest=manifest)
File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging.py", line 37, in package_model
model_path = ModelExportUtils.copy_artifacts(model_name, **artifact_files)
File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging_utils.py", line 150, in copy_artifacts
shutil.copy(path, model_path)
File "/opt/conda/lib/python3.6/shutil.py", line 245, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/opt/conda/lib/python3.6/shutil.py", line 120, in copyfile
with open(src, 'rb') as fsrc:
System information A description of your system. Please provide:
- SageMaker Python SDK version: 2.15.0
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
- Framework version: 1.6
- Python version: 3
- CPU or GPU: GPU
- Custom Docker image (Y/N): N
Additional context
N/A
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 7
- Comments: 15 (1 by maintainers)
Any update on this issue?
From my perspective this seems overly restrictive to require a single file called “model.pth”. Here are some potential use cases that causes troubles for:
Can we get some guidance on whether this issue will be fixed? We’d really like to use versions greater than 1.5. Thanks in advance!
Here’s a sample notebook that succesfully deploys a PyTorch 1.6 model using the SageMaker PyTorch 1.6 framework container with TorchServe:
https://github.com/data-science-on-aws/workshop/blob/374329adf15bf1810bfc4a9e73501ee5d3b4e0f5/09_deploy/wip/pytorch/01_Deploy_RoBERTa.ipynb
We use a pre-trained Hugging Face Roberta model with a custom inference script: https://github.com/data-science-on-aws/workshop/blob/374329adf15bf1810bfc4a9e73501ee5d3b4e0f5/09_deploy/wip/pytorch/code/inference.py
It seems that the model has be called exactly
model.pth
for the SageMaker PyTorch 1.6 serving container to pick it up correctly: See: https://github.com/aws/sagemaker-pytorch-inference-toolkit/blob/6936c08581e26ff3bac26824b1e4946ec68ffc85/src/sagemaker_pytorch_serving_container/torchserve.py#L45Hope that helps. Antje
@antje Thanks for your response. Yes this seems a reasonable path. I didn’t train within Sagemaker and didn’t plan to use Sagemaker for inferencing (FastAPI is easy and fast). But now I have to, due to certain technical considerations so I wasn’t sure if it is a good idea to convert
pytorch_model.bin
tomodel.pth
. Even after that SentenceBERT’s model directory structure didn’t give me confidence that it will work.Just to give you an update: I downgraded from V1.6 to V1.5 in pytorch and it seems to be able to identify model_fn() now. To me it looks like for V1.6 onwards torchserve is a default toolkit and it has that
default model name = model.pth
piece. Not a good idea. Hope the folks make it more flexible@abmitra84 I simply added a custom save_pytorch_model function which saves my Huggingface model using torch.save() as follows:
I call this save function at the end of my model training code. This gives me the model.pth file.
Thanks team! This issue is now resolved for me as of ~April (approximately, depending which framework version we’re talking about) so I’ll close it.
The new releases of the PyTorch containers consume the fixed version of TorchServe, which means they’re able to see other files apart from
model.pth
(like the inference script, and any other artifacts floating around). I’ve also confirmed this means we no longer need to specifically name our modelsmodel.pth
- so long as the custommodel_fn
knows what to look for.For example, I can now have a
model.tar.gz
like below and it works fine:For anybody that’s still seeing the issue, you may need to upgrade and/or stop specifying the patch version of the framework: E.g. use
framework_version="1.7"
and notframework_version="1.7.0"
.Note that if we want to be able to directly
estimator.deploy()
, I’m seeing we still need to:{model_dir}/code
in the training job - so they’re present in the model tarball at the right locationfrom inference import *
intrain.py
, because the entry-point will just be carried over from training to inference unless you specifically re-configure it, so it’ll still be pointing attrain.py
.…Of course if you use a
PyTorchModel
you don’t need to do either of these things - because it re-packages the model tarball with thesource_dir
you specify, and sets theentry_point
.The issue @clashofphish is seeing is something separate I believe - possibly related to using an S3 tarball for
source_dir
rather than a local folder… Does it work for you ifsource_dir
is instead a local folder containinginference_code.py
? Does the root of yourmodel.tar.gz
on S3 definitely containinference_code.py
? Does renaming toinference.py
(which used to be a fixed/required name for some other frameworks I think) help? I’d suggest raising a separate issue for this if you’re still having trouble with it!I’m having the same issue trying to extend a PyTorch 1.8 preconfigured container. I am attempting to use M2M_100 transformer from Huggingface. I downloaded the model using Huggingface’s
from_pretrained
and saved usingsave_pretrained
. This saves a few files for the model, including a couple .json files, a .bin model file, a file for SentencePiece (.model).I have created a model.tar.gz file that contains my model saved in a “model” sub dir and my entry point code (containing model_fn, input_fn, etc.) with a requirements.txt file in a “code” sub dir. I’m attempting to deploy a
PyTorchModel
(since I don’t need nor want to train/tune). However, the container either (a) errors out and tells me that I should provide a model_fn function (refering me to a doc link for this) or (b) ignores my entry_point code and tries to use thedefault_model_fn
.My PyTorchModel configuration looks like:
I’m following AWS tutorials that suggest the patterns I have put into use. Like: https://aws.amazon.com/blogs/startups/how-startups-deploy-pretrained-models-on-amazon-sagemaker/
The weird thing is that I can deploy a test setup that follows almost the same pattern using
sagemaker[local]
on my laptop no problem. The only real difference is that I’m point thesource_dir
to a local directory instead of at themodel.tar.gz
file. I tried pointingsource_dir
to an S3 bucket that contains theinference_code.py
file rather than the tar.gz file and this did not work giving me the above error b.Is there a fix? Do I have to somehow force my model to save as a
.pth
file (not sure how this will work giving the multiple dependencies of M2M_100)? Do I have to create a custom container rather than extend a preconfigured one? Do I just downgrade to PyTorch 1.5?@antje how do you save model as .pth as Huggingface by default saves it as pytorch_model.bin ?
Does this problem go away with lower pytorch version? I am using Pytorch V1.6.0
If anyone has any idea, I am having trouble with SBERT deployment ([https://www.sbert.net/]). SBERT was trained using Huggingface but underneath the directory structure it creates is little different. It keeps the bin file within a sub folder (0_Transformers) and sagemaker just can’t figure out where the model.pth is. It keeps giving that error without going to model_fn where the SBERT based model loading is carried out by SentenceBERT wrapper
Any idea will be helpful
@ajaykarpur