mlflow: RunId not found when executing "mlflow run" with remote tracking server

System information

Have I written custom code (as opposed to using a stock example script provided in MLflow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
MLflow installed from (source or binary):
MLflow version (run mlflow --version): 0.7.0
Python version: 3.6
**npm version (if running the dev UI):
Exact command to reproduce:

Describe the problem

I have followed the tutorial and was able to get everything to work, and then added remote tracking to the train.py script so it will log everything to a remote mlflow server. That works well when executing the script directly via python, but when packaging that same script and using mlflow run to run it, I am observing errors that indicate mlflow is generating a unique experiment ID and then trying to look it up in the remote tracking server, and fails.

The client-side (where mlflow run runs) error is:

icsl6700> .venv/bin/mlflow run tutorial -P alpha=0.42 === Created directory /tmp/tmp65fqzyo2 for downloading remote URIs passed to arguments of type ‘path’ === === Running command ‘source activate mlflow-3eee9bd7a0713cf80a17bc0a4d659bc9c549efac && python train.py 0.42 0.1’ in run with ID ‘63e35b4f66164976b27f3d849c0fe72e’ === API request to http://mlflow-server:5000/api/2.0/preview/mlflow/runs/get failed with code 500 != 200, retrying up to 2 more times. API response body: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>
Internal Server Error

The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
API request to http://mlflow-server:5000/api/2.0/preview/mlflow/runs/get failed with code 500 != 200, retrying up to 1 more times. API response body: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> <title>500 Internal Server Error</title>
Internal Server Error

The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
API request to http://mlflow-server:5000/api/2.0/preview/mlflow/runs/get failed with code 500 != 200, retrying up to 0 more times. API response body: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> <title>500 Internal Server Error</title>
Internal Server Error

The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
Traceback (most recent call last): File "train.py", line 40, in <module> with mlflow.start_run(): File "/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/mlflow/tracking/fluent.py", line 105, in start_run active_run_obj = MlflowClient().get_run(existing_run_uuid) File "/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/mlflow/tracking/client.py", line 37, in get_run return self.store.get_run(run_id) File "/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/mlflow/store/rest_store.py", line 132, in get_run response_proto = self._call_endpoint(GetRun, req_body) File "/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/mlflow/store/rest_store.py", line 68, in _call_endpoint json=json_body) File "/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/mlflow/utils/rest_utils.py", line 53, in http_request (url, retries)) mlflow.exceptions.MlflowException: API request to http://mlflow-server:5000/api/2.0/preview/mlflow/runs/get failed to return code 200 after 3 tries

And the server-side error is:

Oct 07 17:38:49 icsl6688 mlflow[19938]: [2018-10-07 17:38:49,690] ERROR in app: Exception on /api/2.0/preview/mlflow/runs/get [GET] Oct 07 17:38:49 icsl6688 mlflow[19938]: Traceback (most recent call last): Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/flask/app.py”, line 2292, in wsgi_app Oct 07 17:38:49 icsl6688 mlflow[19938]: response = self.full_dispatch_request() Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/flask/app.py”, line 1815, in full_dispatch_request Oct 07 17:38:49 icsl6688 mlflow[19938]: rv = self.handle_user_exception(e) Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/flask/app.py”, line 1718, in handle_user_exception Oct 07 17:38:49 icsl6688 mlflow[19938]: reraise(exc_type, exc_value, tb) Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/flask/_compat.py”, line 35, in reraise Oct 07 17:38:49 icsl6688 mlflow[19938]: raise value Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/flask/app.py”, line 1813, in full_dispatch_request Oct 07 17:38:49 icsl6688 mlflow[19938]: rv = self.dispatch_request() Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/flask/app.py”, line 1799, in dispatch_request Oct 07 17:38:49 icsl6688 mlflow[19938]: return self.view_functionsrule.endpoint Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/mlflow/server/handlers.py”, line 218, in _get_run Oct 07 17:38:49 icsl6688 mlflow[19938]: response_message.run.MergeFrom(_get_store().get_run(request_message.run_uuid).to_proto()) Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/mlflow/store/file_store.py”, line 296, in get_run Oct 07 17:38:49 icsl6688 mlflow[19938]: run_info = self._get_run_info(run_uuid) Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/mlflow/store/file_store.py”, line 310, in _get_run_info Oct 07 17:38:49 icsl6688 mlflow[19938]: raise Exception(“Run ‘%s’ not found” % run_uuid) Oct 07 17:38:49 icsl6688 mlflow[19938]: Exception: Run ‘6c72666f96ff4e0fbd7cb2002f074460’ not found

Since mlflow run sets a random runid in the environment, mlflow picks it up and tries to look it up in the tracking server - and fails. Perhaps I am missing something about the correct use-case of mlflow run but assuming packaged code can contain a remote tracking URI, won’t it always cause it to fail when somebody else tries to run that packaged code with mlflow run?

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 3
Comments: 20 (5 by maintainers)

Most upvoted comments

How are ya’ll specifying the tracking URI? I’d recommend doing so by setting the MLFLOW_TRACKING_URI environment variable (which will be propagated to the Python process launched to run your Python script) instead of calling mlflow.set_tracking_uri("http://some-uri") within your Python script.

Calling mlflow.set_tracking_uri("http://some-uri") within your Python script and then attempting to execute the script via mlflow run won’t work unless you set the MLFLOW_TRACKING_URI environment variable to the same value before invoking mlflow run, e.g. via MLFLOW_TRACKING_URI=http://some-uri mlflow run ....

This is because (as @sonnehansen mentioned) mlflow run first creates a run against the currently-configured tracking URI (so by default, the ./mlruns directory on your local filesystem) and then attempts to load that run when start_run is called in your Python script - if the tracking URIs are different in the two cases, you may see the abovementioned errors.

+48

smurching on Jan 15, 2019

In my case it was just a matter of moving the mlflow.set_tracking_uri above mlflow.start_run()

So I changed from this:

    with mlflow.start_run() as run:
        mlflow.set_tracking_uri('http://localhost:5000')
        log_param("param1", randint(0, 100))

To this:

    mlflow.set_tracking_uri('http://localhost:5000')
    with mlflow.start_run() as run:
        log_param("param1", randint(0, 100))

BartlomiejSkwira on Apr 21, 2021

I also have de same issue with version 1.8.0. I’ve tried both with MLFLOW_TRACKING_URI and mlflow.set_tracking_uri.

When I use MLFLOW_TRACKING_URI, is not connecting to the server but just storing the runs in the project’s mlruns project.

techtutor-co on Jun 15, 2020

Ok I think https://github.com/mlflow/mlflow/issues/2649 is duplicate of this one. But is it really solved? I agry that @smurching solution is working but is this really the optimal behaviour?

Seems to me that setting the uri in the script looks cleaner than setting mannualy an env variable.

Tyrannas on Mar 31, 2020

for me this helped:

experiment_name = "my_experiment"

mlflow.set_experiment(experiment_name)
experiment = mlflow.get_experiment_by_name(experiment_name)
client = mlflow.tracking.MlflowClient()
run = client.create_run(experiment.experiment_id)
with mlflow.start_run(run_id = run.info.run_id):
    pass # do stuff now

pcjedi on Dec 22, 2021

I experience the same issue using python 3.7.0 and mlflow 0.7.0. I use remote mlflow server (running in Docker) and then trying to invoke mlflow run…

From my simple understanding it seems that mlflow run creates a new RUN_IDon the fly. A subsequent request to find it on the remote server fails because it does not exist here.

Client side:

`$ mlflow run . === Created directory /tmp/tmpxrufk7z9 for downloading remote URIs passed to arguments of type ‘path’ === === Running command ‘source activate mlflow-5003dfa88d6c9061e9623d2074be952dd30f85f0 && python simple_mlflow.py gini 7’ in run with ID ‘591aa2b23550476ca3af1715e4d9da91’ === API request to http://osi5398:5000/api/2.0/preview/mlflow/runs/get failed with code 500 != 200, retrying up to 2 more times. API response body: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">

<title>500 Internal Server Error</title>

Internal Server Error

The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.

API request to http://osi5398:5000/api/2.0/preview/mlflow/runs/get failed with code 500 != 200, retrying up to 1 more times. API response body: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">

<title>500 Internal Server Error</title>

Internal Server Error

The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.

API request to http://osi5398:5000/api/2.0/preview/mlflow/runs/get failed with code 500 != 200, retrying up to 0 more times. API response body: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">

<title>500 Internal Server Error</title>

Internal Server Error

The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.

Traceback (most recent call last): File “simple_mlflow.py”, line 23, in <module> with mlflow.start_run(): File “/opt/apps/miniconda3/envs/mlflow-5003dfa88d6c9061e9623d2074be952dd30f85f0/lib/python3.6/site-packages/mlflow/tracking/fluent.py”, line 105, in start_run active_run_obj = MlflowClient().get_run(existing_run_uuid) File “/opt/apps/miniconda3/envs/mlflow-5003dfa88d6c9061e9623d2074be952dd30f85f0/lib/python3.6/site-packages/mlflow/tracking/client.py”, line 37, in get_run return self.store.get_run(run_id) File “/opt/apps/miniconda3/envs/mlflow-5003dfa88d6c9061e9623d2074be952dd30f85f0/lib/python3.6/site-packages/mlflow/store/rest_store.py”, line 126, in get_run response_proto = self._call_endpoint(GetRun, req_body) File “/opt/apps/miniconda3/envs/mlflow-5003dfa88d6c9061e9623d2074be952dd30f85f0/lib/python3.6/site-packages/mlflow/store/rest_store.py”, line 68, in _call_endpoint json=json_body) File “/opt/apps/miniconda3/envs/mlflow-5003dfa88d6c9061e9623d2074be952dd30f85f0/lib/python3.6/site-packages/mlflow/utils/rest_utils.py”, line 53, in http_request (url, retries)) mlflow.exceptions.MlflowException: API request to http://osi5398:5000/api/2.0/preview/mlflow/runs/get failed to return code 200 after 3 tries === Run (ID ‘591aa2b23550476ca3af1715e4d9da91’) failed ===`

server side:

[2018-10-19 11:04:15,093] ERROR in app: Exception on /api/2.0/preview/mlflow/runs/get [GET] Traceback (most recent call last): File “/usr/local/lib/python3.7/site-packages/flask/app.py”, line 2292, in wsgi_app response = self.full_dispatch_request() File “/usr/local/lib/python3.7/site-packages/flask/app.py”, line 1815, in full_dispatch_request rv = self.handle_user_exception(e) File “/usr/local/lib/python3.7/site-packages/flask/app.py”, line 1718, in handle_user_exception reraise(exc_type, exc_value, tb) File “/usr/local/lib/python3.7/site-packages/flask/_compat.py”, line 35, in reraise raise value File “/usr/local/lib/python3.7/site-packages/flask/app.py”, line 1813, in full_dispatch_request rv = self.dispatch_request() File “/usr/local/lib/python3.7/site-packages/flask/app.py”, line 1799, in dispatch_request return self.view_functionsrule.endpoint File “/usr/local/lib/python3.7/site-packages/mlflow/server/handlers.py”, line 218, in _get_run response_message.run.MergeFrom(_get_store().get_run(request_message.run_uuid).to_proto()) File “/usr/local/lib/python3.7/site-packages/mlflow/store/file_store.py”, line 296, in get_run run_info = self._get_run_info(run_uuid) File “/usr/local/lib/python3.7/site-packages/mlflow/store/file_store.py”, line 310, in _get_run_info raise Exception(“Run ‘%s’ not found” % run_uuid) Exception: Run ‘591aa2b23550476ca3af1715e4d9da91’ not found

Looking into the server-side mlruns folder the RUN_ID is not in the list of existing runs: root@a47d655bc6f6:/# ls mlflow/mlruns/0 03f397a8cd774d5b925c68f3bfd3b4f0 4edc44175ef3435ba15ac161b7a490fc meta.yaml 1aa362c53ab14188bfb824670e675616 ae1314bf8e2442b9a921d4c49a2b3a99 4b635d741d3345b7b7c0d13173324c1f e2c64a044f044bd69badd4ba32ffa34f

sonnehansen on Oct 19, 2018

Facing the same issue. Any progress or workaround? thanks

anisnouri on Oct 31, 2018

Apparently mlflow run actually sets that environment variable by itself for some reason (which then follows the correct logic of trying to “resume” - but I don’t think it should be setting it in the first place).

dmarkhas on Oct 8, 2018