mlflow: [BUG] Sagemaker serverless deployment fails using mlflow-pyfunc image

Issues Policy acknowledgement

I have read and agree to submit bug reports in accordance with the issues policy

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

Client: 2.7.0-2.7.1
Tracking server: 1.x.y

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Python version: 3.8/3.9
yarn version, if running the dev UI:

Describe the problem

When deploying a pyfunc model to a serverless sagemaker endpoint using the default mlflow pyfunc image (2.7.1 in this case), the endpoint fails creation. There seems to be an issue with how dependencies are packaged and specifically with accessing pyenv. Pyenv is missing in the executed environment.

Also, as it stands today it is not possible to update a realtime endpoint to serverless. Right now, that functionality is built into MLFlow but does not work as expected. https://github.com/mlflow/mlflow/blob/0d5adbd03be636b89f8b087f270f2aaedd19da93/mlflow/sagemaker/__init__.py#L1706C1-L1707C71

Tracking information

n/a

Code to reproduce issue

from mlflow.deployments import get_deploy_client

sagemaker_client = get_deploy_client('sagemaker')
model_uri = 'a_unique_uri'
serverless_config = {
        "MemorySizeInMB": 2048,
        "MaxConcurrency": 20,
}
bucket_name = "test_bucket"
config = {
    'region_name': "us-east-1",
    'execution_role_arn': role_arn,
    'bucket': bucket_name,
    'serverless_config': serverless_config
}
model_name = "testing-serverless"
endpoint = None
flavor = "python_function"
sagemaker_client.create_deployment(
    model_name,
    model_uri,
    flavor,
    config,
    endpoint
)

Stack trace

The deployment operation failed with the following error message: "An unknown SageMaker failure occurred. Please see the SageMaker console logs for more information."
Error creating model
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.9/3.9.17_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/Cellar/python@3.9/3.9.17_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/build.py", line 27, in <module>
    start()
  File "/Users/build.py", line 23, in start
    raise e
  File "/Users/build.py", line 19, in start
    deploy.aws(packaged_model)
  File "/Users/deploy.py", line 87, in aws
    sagemaker_client.create_deployment(
  File "/Users/.local/share/virtualenvs/my_model-BEzerOf2-python/lib/python3.9/site-packages/mlflow/sagemaker/__init__.py", line 2261, in create_deployment
    app_name, flavor = _deploy(
  File "/Users/.local/share/virtualenvs/my_model-BEzerOf2-python/lib/python3.9/site-packages/mlflow/sagemaker/__init__.py", line 471, in _deploy
    raise MlflowException(
mlflow.exceptions.MlflowException: The deployment operation failed with the following error message: "An unknown SageMaker failure occurred. Please see the SageMaker console logs for more information."

Other info / logs

 #AWS cloudwatch logs from the endpoint
--
2023/10/03 12:52:08 INFO mlflow.models.container: creating and activating custom environment
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/local/lib/python3.8/dist-packages/mlflow/models/container/__init__.py", line 53, in _init
_serve(env_manager)
File "/usr/local/lib/python3.8/dist-packages/mlflow/models/container/__init__.py", line 75, in _serve
_serve_pyfunc(m, env_manager)
File "/usr/local/lib/python3.8/dist-packages/mlflow/models/container/__init__.py", line 147, in _serve_pyfunc
_install_pyfunc_deps(
File "/usr/local/lib/python3.8/dist-packages/mlflow/models/container/__init__.py", line 111, in _install_pyfunc_deps
env_activate_cmd = _get_or_create_virtualenv(model_path)
File "/usr/local/lib/python3.8/dist-packages/mlflow/utils/virtualenv.py", line 339, in _get_or_create_virtualenv
_validate_pyenv_is_available()
File "/usr/local/lib/python3.8/dist-packages/mlflow/utils/virtualenv.py", line 64, in _validate_pyenv_is_available
raise MlflowException(
mlflow.exceptions.MlflowException: Could not find the pyenv binary. See https://github.com/pyenv/pyenv#installation for installation instructions.

What component(s) does this bug affect?

area/artifacts: Artifact stores and artifact logging
area/build: Build and test infrastructure for MLflow
area/docs: MLflow documentation pages
area/examples: Example code
area/gateway: AI Gateway service, Gateway client APIs, third-party Gateway integrations
area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
area/models: MLmodel format, model serialization/deserialization, flavors
area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
area/projects: MLproject format, project running backends
area/scoring: MLflow Model server, model deployment tools, Spark UDFs
area/server-infra: MLflow Tracking server backend
area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

What language(s) does this bug affect?

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

About this issue

Original URL
State: open
Created 9 months ago
Comments: 22 (16 by maintainers)

Most upvoted comments

@anirvansen there are a few things you could do.

Create your own custom mlflow pyfunc image
Utilize the boto3 sdk directly for serverless deployments

Both options aren’t great.

The first option I’ve explored a little but the pyfunc image is a bit black boxy and I’m not completely sure how to solve the pyenv issue. The second option you’d have to utilize your own sagemaker image for deployment/inference and lose all the built in logic the generic pyfunc image provides (type validation, error handling, etc).

clarkh-ncino on Jan 4, 2024