mlflow: [BUG] Sagemaker serverless deployment fails using mlflow-pyfunc image

Issues Policy acknowledgement

  • I have read and agree to submit bug reports in accordance with the issues policy

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

  • Client: 2.7.0-2.7.1
  • Tracking server: 1.x.y

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
  • Python version: 3.8/3.9
  • yarn version, if running the dev UI:

Describe the problem

When deploying a pyfunc model to a serverless sagemaker endpoint using the default mlflow pyfunc image (2.7.1 in this case), the endpoint fails creation. There seems to be an issue with how dependencies are packaged and specifically with accessing pyenv. Pyenv is missing in the executed environment.

Also, as it stands today it is not possible to update a realtime endpoint to serverless. Right now, that functionality is built into MLFlow but does not work as expected. https://github.com/mlflow/mlflow/blob/0d5adbd03be636b89f8b087f270f2aaedd19da93/mlflow/sagemaker/__init__.py#L1706C1-L1707C71

Tracking information

n/a

Code to reproduce issue

from mlflow.deployments import get_deploy_client

sagemaker_client = get_deploy_client('sagemaker')
model_uri = 'a_unique_uri'
serverless_config = {
        "MemorySizeInMB": 2048,
        "MaxConcurrency": 20,
}
bucket_name = "test_bucket"
config = {
    'region_name': "us-east-1",
    'execution_role_arn': role_arn,
    'bucket': bucket_name,
    'serverless_config': serverless_config
}
model_name = "testing-serverless"
endpoint = None
flavor = "python_function"
sagemaker_client.create_deployment(
    model_name,
    model_uri,
    flavor,
    config,
    endpoint
)

Stack trace

The deployment operation failed with the following error message: "An unknown SageMaker failure occurred. Please see the SageMaker console logs for more information."
Error creating model
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.9/3.9.17_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/Cellar/python@3.9/3.9.17_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/build.py", line 27, in <module>
    start()
  File "/Users/build.py", line 23, in start
    raise e
  File "/Users/build.py", line 19, in start
    deploy.aws(packaged_model)
  File "/Users/deploy.py", line 87, in aws
    sagemaker_client.create_deployment(
  File "/Users/.local/share/virtualenvs/my_model-BEzerOf2-python/lib/python3.9/site-packages/mlflow/sagemaker/__init__.py", line 2261, in create_deployment
    app_name, flavor = _deploy(
  File "/Users/.local/share/virtualenvs/my_model-BEzerOf2-python/lib/python3.9/site-packages/mlflow/sagemaker/__init__.py", line 471, in _deploy
    raise MlflowException(
mlflow.exceptions.MlflowException: The deployment operation failed with the following error message: "An unknown SageMaker failure occurred. Please see the SageMaker console logs for more information."

Other info / logs

 #AWS cloudwatch logs from the endpoint
--
2023/10/03 12:52:08 INFO mlflow.models.container: creating and activating custom environment
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/local/lib/python3.8/dist-packages/mlflow/models/container/__init__.py", line 53, in _init
_serve(env_manager)
File "/usr/local/lib/python3.8/dist-packages/mlflow/models/container/__init__.py", line 75, in _serve
_serve_pyfunc(m, env_manager)
File "/usr/local/lib/python3.8/dist-packages/mlflow/models/container/__init__.py", line 147, in _serve_pyfunc
_install_pyfunc_deps(
File "/usr/local/lib/python3.8/dist-packages/mlflow/models/container/__init__.py", line 111, in _install_pyfunc_deps
env_activate_cmd = _get_or_create_virtualenv(model_path)
File "/usr/local/lib/python3.8/dist-packages/mlflow/utils/virtualenv.py", line 339, in _get_or_create_virtualenv
_validate_pyenv_is_available()
File "/usr/local/lib/python3.8/dist-packages/mlflow/utils/virtualenv.py", line 64, in _validate_pyenv_is_available
raise MlflowException(
mlflow.exceptions.MlflowException: Could not find the pyenv binary. See https://github.com/pyenv/pyenv#installation for installation instructions.

What component(s) does this bug affect?

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/gateway: AI Gateway service, Gateway client APIs, third-party Gateway integrations
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

What language(s) does this bug affect?

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

About this issue

  • Original URL
  • State: open
  • Created 9 months ago
  • Comments: 22 (16 by maintainers)

Most upvoted comments

@anirvansen there are a few things you could do.

  1. Create your own custom mlflow pyfunc image
  2. Utilize the boto3 sdk directly for serverless deployments

Both options aren’t great.

The first option I’ve explored a little but the pyfunc image is a bit black boxy and I’m not completely sure how to solve the pyenv issue. The second option you’d have to utilize your own sagemaker image for deployment/inference and lose all the built in logic the generic pyfunc image provides (type validation, error handling, etc).