mlflow: RunId not found when executing "mlflow run" with remote tracking server
System information
- Have I written custom code (as opposed to using a stock example script provided in MLflow):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
- MLflow installed from (source or binary):
- MLflow version (run
mlflow --version): 0.7.0 - Python version: 3.6
- **npm version (if running the dev UI):
- Exact command to reproduce:
Describe the problem
I have followed the tutorial and was able to get everything to work, and then added remote tracking to the train.py script so it will log everything to a remote mlflow server. That works well when executing the script directly via python, but when packaging that same script and using mlflow run to run it, I am observing errors that indicate mlflow is generating a unique experiment ID and then trying to look it up in the remote tracking server, and fails.
The client-side (where mlflow run runs) error is:
icsl6700> .venv/bin/mlflow run tutorial -P alpha=0.42 === Created directory /tmp/tmp65fqzyo2 for downloading remote URIs passed to arguments of type ‘path’ === === Running command ‘source activate mlflow-3eee9bd7a0713cf80a17bc0a4d659bc9c549efac && python train.py 0.42 0.1’ in run with ID ‘63e35b4f66164976b27f3d849c0fe72e’ === API request to http://mlflow-server:5000/api/2.0/preview/mlflow/runs/get failed with code 500 != 200, retrying up to 2 more times. API response body: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>Internal Server Error
The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
API request to http://mlflow-server:5000/api/2.0/preview/mlflow/runs/get failed with code 500 != 200, retrying up to 1 more times. API response body: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> <title>500 Internal Server Error</title>Internal Server Error
The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
API request to http://mlflow-server:5000/api/2.0/preview/mlflow/runs/get failed with code 500 != 200, retrying up to 0 more times. API response body: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> <title>500 Internal Server Error</title>Internal Server Error
The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
Traceback (most recent call last): File "train.py", line 40, in <module> with mlflow.start_run(): File "/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/mlflow/tracking/fluent.py", line 105, in start_run active_run_obj = MlflowClient().get_run(existing_run_uuid) File "/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/mlflow/tracking/client.py", line 37, in get_run return self.store.get_run(run_id) File "/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/mlflow/store/rest_store.py", line 132, in get_run response_proto = self._call_endpoint(GetRun, req_body) File "/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/mlflow/store/rest_store.py", line 68, in _call_endpoint json=json_body) File "/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/mlflow/utils/rest_utils.py", line 53, in http_request (url, retries)) mlflow.exceptions.MlflowException: API request to http://mlflow-server:5000/api/2.0/preview/mlflow/runs/get failed to return code 200 after 3 tries
And the server-side error is:
Oct 07 17:38:49 icsl6688 mlflow[19938]: [2018-10-07 17:38:49,690] ERROR in app: Exception on /api/2.0/preview/mlflow/runs/get [GET] Oct 07 17:38:49 icsl6688 mlflow[19938]: Traceback (most recent call last): Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/flask/app.py”, line 2292, in wsgi_app Oct 07 17:38:49 icsl6688 mlflow[19938]: response = self.full_dispatch_request() Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/flask/app.py”, line 1815, in full_dispatch_request Oct 07 17:38:49 icsl6688 mlflow[19938]: rv = self.handle_user_exception(e) Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/flask/app.py”, line 1718, in handle_user_exception Oct 07 17:38:49 icsl6688 mlflow[19938]: reraise(exc_type, exc_value, tb) Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/flask/_compat.py”, line 35, in reraise Oct 07 17:38:49 icsl6688 mlflow[19938]: raise value Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/flask/app.py”, line 1813, in full_dispatch_request Oct 07 17:38:49 icsl6688 mlflow[19938]: rv = self.dispatch_request() Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/flask/app.py”, line 1799, in dispatch_request Oct 07 17:38:49 icsl6688 mlflow[19938]: return self.view_functionsrule.endpoint Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/mlflow/server/handlers.py”, line 218, in _get_run Oct 07 17:38:49 icsl6688 mlflow[19938]: response_message.run.MergeFrom(_get_store().get_run(request_message.run_uuid).to_proto()) Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/mlflow/store/file_store.py”, line 296, in get_run Oct 07 17:38:49 icsl6688 mlflow[19938]: run_info = self._get_run_info(run_uuid) Oct 07 17:38:49 icsl6688 mlflow[19938]: File “/nfs/site/disks/ibi_tools/workarea/dmarkhas/python/.venv/lib/python3.6/site-packages/mlflow/store/file_store.py”, line 310, in _get_run_info Oct 07 17:38:49 icsl6688 mlflow[19938]: raise Exception(“Run ‘%s’ not found” % run_uuid) Oct 07 17:38:49 icsl6688 mlflow[19938]: Exception: Run ‘6c72666f96ff4e0fbd7cb2002f074460’ not found
Since mlflow run sets a random runid in the environment, mlflow picks it up and tries to look it up in the tracking server - and fails.
Perhaps I am missing something about the correct use-case of mlflow run but assuming packaged code can contain a remote tracking URI, won’t it always cause it to fail when somebody else tries to run that packaged code with mlflow run?
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 3
- Comments: 20 (5 by maintainers)
How are ya’ll specifying the tracking URI? I’d recommend doing so by setting the
MLFLOW_TRACKING_URIenvironment variable (which will be propagated to the Python process launched to run your Python script) instead of callingmlflow.set_tracking_uri("http://some-uri")within your Python script.Calling
mlflow.set_tracking_uri("http://some-uri")within your Python script and then attempting to execute the script viamlflow runwon’t work unless you set theMLFLOW_TRACKING_URIenvironment variable to the same value before invokingmlflow run, e.g. viaMLFLOW_TRACKING_URI=http://some-uri mlflow run ....This is because (as @sonnehansen mentioned)
mlflow runfirst creates a run against the currently-configured tracking URI (so by default, the ./mlruns directory on your local filesystem) and then attempts to load that run whenstart_runis called in your Python script - if the tracking URIs are different in the two cases, you may see the abovementioned errors.In my case it was just a matter of moving the
mlflow.set_tracking_uriabovemlflow.start_run()So I changed from this:
To this:
I also have de same issue with version 1.8.0. I’ve tried both with MLFLOW_TRACKING_URI and mlflow.set_tracking_uri.
When I use MLFLOW_TRACKING_URI, is not connecting to the server but just storing the runs in the project’s mlruns project.
Ok I think https://github.com/mlflow/mlflow/issues/2649 is duplicate of this one. But is it really solved? I agry that @smurching solution is working but is this really the optimal behaviour?
Seems to me that setting the uri in the script looks cleaner than setting mannualy an env variable.
for me this helped:
I experience the same issue using python 3.7.0 and mlflow 0.7.0. I use remote mlflow server (running in Docker) and then trying to invoke
mlflow run…From my simple understanding it seems that
mlflow runcreates a newRUN_IDon the fly. A subsequent request to find it on the remote server fails because it does not exist here.Client side:
`$ mlflow run . === Created directory /tmp/tmpxrufk7z9 for downloading remote URIs passed to arguments of type ‘path’ === === Running command ‘source activate mlflow-5003dfa88d6c9061e9623d2074be952dd30f85f0 && python simple_mlflow.py gini 7’ in run with ID ‘591aa2b23550476ca3af1715e4d9da91’ === API request to http://osi5398:5000/api/2.0/preview/mlflow/runs/get failed with code 500 != 200, retrying up to 2 more times. API response body: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>Internal Server Error
The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
API request to http://osi5398:5000/api/2.0/preview/mlflow/runs/get failed with code 500 != 200, retrying up to 1 more times. API response body: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>Internal Server Error
The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
API request to http://osi5398:5000/api/2.0/preview/mlflow/runs/get failed with code 500 != 200, retrying up to 0 more times. API response body: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>Internal Server Error
The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
Traceback (most recent call last): File “simple_mlflow.py”, line 23, in <module> with mlflow.start_run(): File “/opt/apps/miniconda3/envs/mlflow-5003dfa88d6c9061e9623d2074be952dd30f85f0/lib/python3.6/site-packages/mlflow/tracking/fluent.py”, line 105, in start_run active_run_obj = MlflowClient().get_run(existing_run_uuid) File “/opt/apps/miniconda3/envs/mlflow-5003dfa88d6c9061e9623d2074be952dd30f85f0/lib/python3.6/site-packages/mlflow/tracking/client.py”, line 37, in get_run return self.store.get_run(run_id) File “/opt/apps/miniconda3/envs/mlflow-5003dfa88d6c9061e9623d2074be952dd30f85f0/lib/python3.6/site-packages/mlflow/store/rest_store.py”, line 126, in get_run response_proto = self._call_endpoint(GetRun, req_body) File “/opt/apps/miniconda3/envs/mlflow-5003dfa88d6c9061e9623d2074be952dd30f85f0/lib/python3.6/site-packages/mlflow/store/rest_store.py”, line 68, in _call_endpoint json=json_body) File “/opt/apps/miniconda3/envs/mlflow-5003dfa88d6c9061e9623d2074be952dd30f85f0/lib/python3.6/site-packages/mlflow/utils/rest_utils.py”, line 53, in http_request (url, retries)) mlflow.exceptions.MlflowException: API request to http://osi5398:5000/api/2.0/preview/mlflow/runs/get failed to return code 200 after 3 tries === Run (ID ‘591aa2b23550476ca3af1715e4d9da91’) failed ===`
server side:
[2018-10-19 11:04:15,093] ERROR in app: Exception on /api/2.0/preview/mlflow/runs/get [GET] Traceback (most recent call last): File “/usr/local/lib/python3.7/site-packages/flask/app.py”, line 2292, in wsgi_app response = self.full_dispatch_request() File “/usr/local/lib/python3.7/site-packages/flask/app.py”, line 1815, in full_dispatch_request rv = self.handle_user_exception(e) File “/usr/local/lib/python3.7/site-packages/flask/app.py”, line 1718, in handle_user_exception reraise(exc_type, exc_value, tb) File “/usr/local/lib/python3.7/site-packages/flask/_compat.py”, line 35, in reraise raise value File “/usr/local/lib/python3.7/site-packages/flask/app.py”, line 1813, in full_dispatch_request rv = self.dispatch_request() File “/usr/local/lib/python3.7/site-packages/flask/app.py”, line 1799, in dispatch_request return self.view_functionsrule.endpoint File “/usr/local/lib/python3.7/site-packages/mlflow/server/handlers.py”, line 218, in _get_run response_message.run.MergeFrom(_get_store().get_run(request_message.run_uuid).to_proto()) File “/usr/local/lib/python3.7/site-packages/mlflow/store/file_store.py”, line 296, in get_run run_info = self._get_run_info(run_uuid) File “/usr/local/lib/python3.7/site-packages/mlflow/store/file_store.py”, line 310, in _get_run_info raise Exception(“Run ‘%s’ not found” % run_uuid) Exception: Run ‘591aa2b23550476ca3af1715e4d9da91’ not found
Looking into the server-side mlruns folder the
RUN_IDis not in the list of existing runs: root@a47d655bc6f6:/# ls mlflow/mlruns/0 03f397a8cd774d5b925c68f3bfd3b4f0 4edc44175ef3435ba15ac161b7a490fc meta.yaml 1aa362c53ab14188bfb824670e675616 ae1314bf8e2442b9a921d4c49a2b3a99 4b635d741d3345b7b7c0d13173324c1f e2c64a044f044bd69badd4ba32ffa34fFacing the same issue. Any progress or workaround? thanks
Apparently
mlflow runactually sets that environment variable by itself for some reason (which then follows the correct logic of trying to “resume” - but I don’t think it should be setting it in the first place).