mlflow: [BUG] mlflow logs pytorch model instead of weights only -> prevents serving modularized code

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

Yes. I can contribute a fix for this bug independently.
Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
No. I cannot contribute a bug fix at this time.

System information

Have I written custom code (as opposed to using a stock example script provided in MLflow): Y
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Arch Linux
MLflow installed from (source or binary): Binary
MLflow version (run mlflow --version): 1.10.0
Python version: 3.7.6
Exact command to reproduce: mlflow models serve -m MODELPATH

Describe the problem

I successfully trained a model. Now, when trying to serve it I run into:

zeth@master /tmp> mlflow models serve -m /tmp/exploding_springfield/mlruns/0/f7c632e43f93437280cc72b88f279a56/artifacts/models                                                         (base) 
2020/08/11 14:47:26 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
2020/08/11 14:47:28 INFO mlflow.pyfunc.backend: === Running command 'source /home/zeth/anaconda3/bin/../etc/profile.d/conda.sh && conda activate mlflow-dd325f076f6465c8205b2342fd8ab4531e905e1a 1>&2 && gunicorn --timeout=60 -b 127.0.0.1:5000 -w 1 ${GUNICORN_CMD_ARGS} -- mlflow.pyfunc.scoring_server.wsgi:app'
[2020-08-11 14:47:28 +0200] [20429] [INFO] Starting gunicorn 20.0.4
[2020-08-11 14:47:28 +0200] [20429] [INFO] Listening at: http://127.0.0.1:5000 (20429)
[2020-08-11 14:47:28 +0200] [20429] [INFO] Using worker: sync
[2020-08-11 14:47:28 +0200] [20435] [INFO] Booting worker with pid: 20435
[2020-08-11 14:47:29 +0200] [20435] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/gunicorn/arbiter.py", line 583, in spawn_worker
    worker.init_process()
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/gunicorn/workers/base.py", line 119, in init_process
    self.load_wsgi()
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/gunicorn/workers/base.py", line 144, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 49, in load
    return self.load_wsgiapp()
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 39, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/gunicorn/util.py", line 358, in import_app
    mod = importlib.import_module(module)
  File "/home/zeth/anaconda3/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/mlflow/pyfunc/scoring_server/wsgi.py", line 6, in <module>
    app = scoring_server.init(load_model(os.environ[scoring_server._SERVER_MODEL_PATH]))
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/mlflow/pyfunc/__init__.py", line 473, in load_model
    model_impl = importlib.import_module(conf[MAIN])._load_pyfunc(data_path)
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/mlflow/pytorch/__init__.py", line 423, in _load_pyfunc
    return _PyTorchWrapper(_load_model(path, **kwargs))
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/mlflow/pytorch/__init__.py", line 331, in _load_model
    import torch
ModuleNotFoundError: No module named 'torch'
[2020-08-11 14:47:29 +0200] [20435] [INFO] Worker exiting (pid: 20435)
[2020-08-11 14:47:29 +0200] [20429] [INFO] Shutting down: Master
[2020-08-11 14:47:29 +0200] [20429] [INFO] Reason: Worker failed to boot.
Traceback (most recent call last):
  File "/home/zeth/anaconda3/bin/mlflow", line 8, in <module>
    sys.exit(cli())
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/mlflow/models/cli.py", line 59, in serve
    host=host)
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/mlflow/pyfunc/backend.py", line 92, in serve
    command_env=command_env)
  File "/home/zeth/anaconda3/lib/python3.7/site-packages/mlflow/pyfunc/backend.py", line 172, in _execute_in_conda_env
    command, rc
Exception: Command 'source /home/zeth/anaconda3/bin/../etc/profile.d/conda.sh && conda activate mlflow-dd325f076f6465c8205b2342fd8ab4531e905e1a 1>&2 && gunicorn --timeout=60 -b 127.0.0.1:5000 -w 1 ${GUNICORN_CMD_ARGS} -- mlflow.pyfunc.scoring_server.wsgi:app' returned non zero return code. Return code = 3

The conda.yml file is not broken:

zeth@master /t/e/m/0/f/a/models> bat conda.yaml                                                                                                                                    (mlf-core) 
───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: conda.yaml
───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ channels:
   2   │ - defaults
   3   │ - conda-forge
   4   │ - pytorch
   5   │ dependencies:
   6   │ - python=3.7.7
   7   │ - pytorch=1.6.0
   8   │ - torchvision=0.7.0
   9   │ - pip
  10   │ - pip:
  11   │   - mlflow
  12   │   - cloudpickle==1.5.0
  13   │ name: mlflow-env
───────┴─────────────────────────

And the Conda environment contains torch as well (verified).

I expect the model to be serving without any issues.

Code to reproduce issue

Difficult to share, but if required I can absolutely do so.

What component(s), interfaces, languages, and integrations does this bug affect?

Components

area/models: MLmodel format, model serialization/deserialization, flavors
area/scoring: Local serving, model deployment tools, spark UDFs

About this issue

Original URL
State: open
Created 4 years ago
Comments: 24 (24 by maintainers)

Most upvoted comments

I was able to reproduce the same issue for pytorch.

folder structure

├── dir
│   └── load.py
├── model.py 
└── train.py

code

model.py

import torch

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

train.py

import mlflow.pytorch

from model import TwoLayerNet

with mlflow.start_run():
    N, D_in, H, D_out = 64, 1000, 100, 10
    model = TwoLayerNet(D_in, H, D_out)
    mlflow.pytorch.log_model(model, "model")

load.py

import mlflow.pytorch


mlflow.pytorch.load_model(
    "path/to/model-dir"
)

How to reproduce the error:

python train.py
cd dir
python load.py

output:

...
Traceback (most recent call last):
  File "load.py", line 8, in <module>
    "xxx/model.pth"
  File "xxx/lib/python3.6/site-packages/torch/serialization.py", line 529, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "xxx/lib/python3.6/site-packages/torch/serialization.py", line 702, in _legacy_load
    result = unpickler.load()```
ModuleNotFoundError: No module named 'model'

harupy on Aug 11, 2020

Is this also an issue of mlflow.pytorch.log_model() ?

I guess so. It seems mlflow currenly logs a pytorch model in an unrecommended way.

harupy on Aug 11, 2020