mlflow: [BUG] Using MlflowClient.get_latest_version with an older server instance causes 404

Willingness to contribute

The MLflow Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the MLflow code base?

Yes. I can contribute a fix for this bug independently.
Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.
No. I cannot contribute a bug fix at this time.

System information

Have I written custom code (as opposed to using a stock example script provided in MLflow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): SUSE Linux Enterprise Server 12 SP2
MLflow installed from (source or binary): Binary
MLflow version (run mlflow --version): 1.22 for the client, 1.11.0 for the server
Python version: 3.6.12
npm version, if running the dev UI:
Exact command to reproduce: MlflowClient().get_latest_versions("SOME_REGISTERED_MODEL")

Describe the problem

We have MlFlow 1.11.0 installed on a server, and version 1.22 on the client. When the client is connected to the server, and executing the get_latest_version command, the following message is shown:

MlflowException: API request to endpoint /api/2.0/mlflow/registered-models/get-latest-versions failed with error code 404 != 200. Response body: '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">

<title>404 Not Found</title>

<h1>Not Found</h1>

<p>The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.</p>

The expected result would have been the latest version information of the model.

Code to reproduce issue

Prerequisites:

Run mlflow version 1.11.0 on a server in a conda environment
Connect a client to the mlflow server using mlflow.set_tracking_uri()
Create an experiment on the server using MlflowClient.create_experiment

Now run:

import mlflow
from mlflow.tracking import MlflowClient

mlflow.set_tracking_uri("TRACKING_URI")

client = MlflowClient()
client.get_latest_versions("EXPERIMENT_NAME")

Other info / logs

This problem seems to have been introduced with #4999. In that PR, support for POST-calls on get-latest-version was added. This by itself is not a problem, since the author of the PR checks that if the POST-call is not available on the server, it catches the ENDPOINT_NOT_FOUND exception and tries to use a GET-call.

The problem is created however, because it calls /mlflow/registered-models/get-latest-version instead of the usual /preview/mlflow/registered-models/get-latest-versions. This means that not an ENDPONT_NOT_FOUND exception is thrown, but a 404 Not Found. This exception is not caught, meaning that the program will not continue to try the GET-call and crashes instead.

It seems to me that the omission of preview in the URL is the root cause of the bug (see: https://github.com/stevenchen-db/mlflow/blob/9bbbb0c28d285476e0f3e2a81ecfbf577d1b03ca/mlflow/protos/model_registry.proto#L133), but I’m not sure if that is on purpose or not. If the omission of preview is on purpose, some other way of exception handling could be implemented to avoid getting 404-errors when working with older servers that do not support POST-calls.

Since the reason behind the missing preview part is not entirely clear to me, I did not want to provide a bug fix immediately. If the maintainers could provide some guidance on which approach to fix this bug would be best for this project, I’d be happy to help implement the change.

What component(s), interfaces, languages, and integrations does this bug affect?

Components

area/artifacts: Artifact stores and artifact logging
area/build: Build and test infrastructure for MLflow
area/docs: MLflow documentation pages
area/examples: Example code
area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
area/models: MLmodel format, model serialization/deserialization, flavors
area/projects: MLproject format, project running backends
area/scoring: MLflow Model server, model deployment tools, Spark UDFs
area/server-infra: MLflow Tracking server backend
area/tracking: Tracking Service, tracking client APIs, autologging

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

About this issue

Original URL
State: open
Created 3 years ago
Comments: 17 (3 by maintainers)

Most upvoted comments

Reporting the same issue with 1.23.1. To be more precise, it yields 405 HTTP response code: Method Not Allowed.

A solution to the issue is, both the python and mlflow versions of client and server needs to be the same.

Would be good for the future for the mismatch in versions to be allowed; this would facilitate MLFlow interaction across different projects. Otherwise it is quite troublesome to maintain all projects with different developers at the same time.

Needless to say, the versioning issue also affects the cloudpickle so either way, in current state of things, the version overlap appears is required.

krstp on Feb 2, 2022

Similar issue: Have tracker service running on Python 3.8 + MLFlow 2.8, but it needs to support clients from Python3.6 (so MLFlow 1.23). Ended up patching the endpoint structure on the client:

def _patch_mlf_v1_api_endpoints() -> None:
    """
    MLF v1.x -> v2.x endpoint patcher.
    """
    from mlflow.protos import databricks_pb2
    from mlflow.protos.model_registry_pb2 import ModelRegistryService
    from mlflow.utils.rest_utils import _REST_API_PATH_PREFIX
    import mlflow.store.model_registry.rest_store

    def _patched_extract_api_info_for_service(service, path_prefix):
        service_methods = service.DESCRIPTOR.methods
        res = {}
        for service_method in service_methods:
            endpoint_idx = 1
            endpoint = service_method.GetOptions().Extensions[databricks_pb2.rpc].endpoints[endpoint_idx]
            endpoint_path = f'{path_prefix}{endpoint.path}'
            res[service().GetRequestClass(service_method)] = (endpoint_path, endpoint.method)
        return res

    # patch endpoints stored in _METHOD_TO_INFO for MLF 2.x compatibility
    if int(mlflow.__version__.split('.')[0]) < 2:
        mlflow.store.model_registry.rest_store._METHOD_TO_INFO = _patched_extract_api_info_for_service(
            ModelRegistryService,
            _REST_API_PATH_PREFIX
        )

_patch_mlf_v1_api_endpoints()

astan-iq on Nov 21, 2023

I came across a similar issue for the `` endpoint. MLflow server version: 1.21.0 (docs: https://www.mlflow.org/docs/1.21.0/rest-api.html#get-latest-modelversions) MLflow client version: 1.24.0 (docs: https://www.mlflow.org/docs/1.24.0/rest-api.html#get-latest-modelversions)

The obvious first: the request method changed from GET to POST. Apparently, the MLflow team tries to accommodate for these exact changes here: https://github.com/mlflow/mlflow/blob/e78d6e90b0011b4ad33aa9cda84e8e0c7d202349/mlflow/utils/rest_utils.py#L265-L270 which does not work because if the first method (POST) fails, it will break out of the loop and never try to do the GET method later.

While debugging, I don’t get past the first entry and the exception is reraised:

A current workaround is to down-/upgrade either client or server until the REST API matches. To the MLflow team (not meant to be offensive):

It looks like you already have API versioning in place, please increment the API when introducing breaking changes. Do not try to accommodate for different API versions in the code, it will get harder and harder to read (and frankly, the code is fairly convoluted at this point).
IMHO your semantic versioning is off because changing the API is a major (= breaking) change and not a minor change.
As a quick fix to this issue, do not throw exceptions unless the for loop is exhausted.

tafaust on Mar 31, 2022

@tahesse How would one change their REST API version on either the client or server? I have been unable to find info on how to do that thus far in this case. Any help would be appreciated

ggustavs on Mar 24, 2023

Thanks @daanknoope - please add me as a reviewer and I’m happy to take a look when you have a contribution ready!

Agree - we could add a note to the README maybe that clarifies the mismatch in version behavior. cc: @harupy

@krstp thanks for adding some more color there around the HTTP response code. Actually, more precisely, the only requirement is that the server must have a version greater than the client. Imo, it’s not particularly reasonable for new clients to have to support all older servers - we add new functionality and APIs to the fluent APIs and clients, but the client would have to case for older servers which don’t support this new functionality all the time.

ankit-db on Feb 2, 2022

@ankit-db apologies for the delay, it’s been quite hectic for me the last few weeks. Would still like to help solve this issue, and will contribute as soon as I have time to do so.

@krstp you raise a good point about cloudpickle, maybe a note should be added to the documentation to warn for unexpected behaviour when there is a mismatch in versions?

daanknoope on Feb 2, 2022