mlflow: MLflow worker timeout when opening UI

System information

  • Have I written custom code (as opposed to using a stock example script provided in MLflow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04.5
  • MLflow installed from (source or binary): pip install mlflow
  • MLflow version (run mlflow --version): mlflow, version 0.8.2
  • Python version: Python 3.6.6 :: Anaconda, Inc.
  • **npm version (if running the dev UI):
  • Exact command to reproduce: mlflow server --file-store /bigdata/mlflow --host 0.0.0.0

Describe the problem

MLflow UI shows Niagara falls with “Oops! Something went wrong” every time I try opening it. I’ve been using it for two months, but recently it has started crashing until today I cannot get the UI to open at all anymore.

Logs

server logs after fresh restart:

[2019-02-26 12:34:36 +0000] [9] [INFO] Starting gunicorn 19.9.0
[2019-02-26 12:34:36 +0000] [9] [INFO] Listening at: http://0.0.0.0:5000 (9)
[2019-02-26 12:34:36 +0000] [9] [INFO] Using worker: sync
[2019-02-26 12:34:36 +0000] [12] [INFO] Booting worker with pid: 12
[2019-02-26 12:34:36 +0000] [14] [INFO] Booting worker with pid: 14
[2019-02-26 12:34:36 +0000] [15] [INFO] Booting worker with pid: 15
[2019-02-26 12:34:36 +0000] [18] [INFO] Booting worker with pid: 18
[2019-02-26 12:35:30 +0000] [9] [CRITICAL] WORKER TIMEOUT (pid:14)
[2019-02-26 12:35:30 +0000] [14] [INFO] Worker exiting (pid: 14)
[2019-02-26 12:35:30 +0000] [28] [INFO] Booting worker with pid: 28

browser console logs when opening UI:

setupAjaxHeaders.js:22 
{_xsrf: "2|a583f945|b32757069a3ea1c54e37f87dba1c1428|1549020795"}
service-worker.js:1 Uncaught (in promise) Error: Request for http://localhost:5000/static-files/static-files/static/css/main.fbf8a477.css returned a response with status 404
    at service-worker.js:1
service-worker.js:1 Uncaught (in promise) Error: Request for http://localhost:5000/static-files/static-files/static/css/main.fbf8a477.css returned a response with status 404
    at service-worker.js:1
jquery.js:9355 POST http://localhost:5000/ajax-api/2.0/preview/mlflow/runs/search net::ERR_EMPTY_RESPONSE
Actions.js:155 XHR failed 
{readyState: 0, getResponseHeader: ƒ, getAllResponseHeaders: ƒ, setRequestHeader: ƒ, overrideMimeType: ƒ, …}
react-dom.production.min.js:151 TypeError: Cannot read property 'getErrorCode' of undefined
    at errorRenderFunc (ExperimentPage.js:122)
    at e.value (RequestStateWrapper.js:51)
    at f (react-dom.production.min.js:131)
    at beginWork (react-dom.production.min.js:138)
    at o (react-dom.production.min.js:176)
    at a (react-dom.production.min.js:176)
    at x (react-dom.production.min.js:182)
    at y (react-dom.production.min.js:181)
    at v (react-dom.production.min.js:181)
    at d (react-dom.production.min.js:180)
AppErrorBoundary.js:19 TypeError: Cannot read property 'getErrorCode' of undefined
    at errorRenderFunc (ExperimentPage.js:122)
    at e.value (RequestStateWrapper.js:51)
    at f (react-dom.production.min.js:131)
    at beginWork (react-dom.production.min.js:138)
    at o (react-dom.production.min.js:176)
    at a (react-dom.production.min.js:176)
    at x (react-dom.production.min.js:182)
    at y (react-dom.production.min.js:181)
    at v (react-dom.production.min.js:181)
    at d (react-dom.production.min.js:180)
:5000/#/experiments/1:1 Uncaught (in promise) 
t {xhr: {…}}
```

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 2
  • Comments: 38 (4 by maintainers)

Most upvoted comments

Reopening this issue as per community request and reassign priority to get it into the queue.

mlflow 1.18, year 2021, issue is still here…

@spott @gkonstanty : where are you adding the --gunicorn-opts "--timeout 180" option?

I couldn’t get mlflow ui --gunicorn-opts "--timeout 180" to work either (error: no such option --gunicorn-opts)

But the following worked for me: GUNICORN_CMD_ARGS="--timeout 180" mlflow ui

Problem persists on 1.28.0 MLflow is great, but for me this bug renders it practically useless.

mlflow 1.21 same issue…

I have just upgraded server to run on 1.9.0 version (without postgres backend) and nothing has changed.

Adding --gunicorn-opts "--timeout 180" has somehow helped, but the number of our experiments is constantly growing, so even 180sec will be not sufficient soon. And waiting so long for some simple queries results is kinda annoying.

Could you please check this issue?

Any update on this issue? It still exists on v1.2.0.0

Sorry, @spott, I missed your msg.

I’m adding it to mlflow server:

mlflow server --host 0.0.0.0 -p 5000 --backend-store-uri /mlflow/data/ --default-artifact-root /mlflow/artifacts/ --gunicorn-opts "--timeout 180"

Is there a way for this issue to get even higher priority? I don’t really understand how people use MLflow given that this bug exists. Perhaps it’s because most people aren’t using the UI feature…

The problem is not only with a large number of experiments but with a large number of experiment metrics logged, like in my case. Thus the issue is due to a bad UI design, and I suppose it can be fixed with big refactoring only.

With a large timeout set, our problem seems to be only with the ui generating the experiments table with 1000+ experiments. I wonder if defaulting the experiments side bar to hidden would help as a short term fix? Collapsing the sidebar seems to fix the problem (after waiting a few minutes for it to load). Yes, it would break almost immediately when someone clicks to expand it, but a user could still use the ui if they knew the experiment id before hand.

Maybe these are really two separate issues. One for large runs and one for experiments on the home page?

I received somewhat better performance with this -

mlflow server --backend-store-uri=postgresql://postgres:${RDS_PASSWORD}@${RDS_HOST}:5432/mlflow --default-artifact-root=${ARTIFACT_STORE} --host 0.0.0.0 --port 5000 --gunicorn-opts “–worker-class gevent --threads 3 --workers 3 --timeout 300 --keep-alive 300 --log-level INFO”

This is unexpectedly unpleasant. I did a number of runs with the idea of sorting best-to-worst metrics afterwards. But the UI indeed crashes after more than ~1000 runs…

Moreover, it only has to load the first 100 runs and show “load more” afterwards.

It turns out that the ui simply doesn’t handle too many runs (in my case it starts struggling when mlruns contains more than circa 1000 experiments). Around this threshold the ui becomes unstable (sometimes crashes, sometimes works, but it’s never quick&responsive), and eventually there are too many runs and it won’t load at all.

This goes a bit against the philosophy of being able to track all your experiments.

Would using a local db instead of file storage help? Hosting externally is not an option for me.

As a side note: during troubleshooting I discovered that when you move runs around into different folders, it’s important to update the artifact_location parameter in the main meta.yaml, otherwise you’ll experience a different type of crash, without a clear warning.

I managed to get it fixed by (in my setup):

  • increasing the number of workers --workers 20 (6 should be enough because it’s exactly how many connections are allowed per domain in modern browsers, 20 for good measure).
  • switching to eventlet

It’s worth noting that the root cause, at least in my setup, has nothing to do with mlflow itself. I’m running it in Kubernetes (GKE) and in order to access it on my machine I’m using kubectl port-forward. It looks like the way kubectl is proxying the requests exhausts all workers and since they are synchronous by default no new connections can be accepted. Supposedly port-forwarding isn’t compatible with the event Gunicorn worker class.

@here Could you describe your setup a bit, please? Are you using kubectl port-forward or anything similar to it?

Well I found one workaround here. Try mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://DB_USER:DB_PASSWD@DB_ENDPOINT:5432/DB_NAME --default-artifact-root s3://S3_BUCKET_NAME --gunicorn-opts "--timeout 0"

This will wait till the data transfer finishes and loads, I have to delete a few of my experiments from my S3 bucket to make it load a little faster which I think is not a very welcoming workaround.

Will have to wait for a permanent fix to this.

mlflow 1.26.1 still same issue! Can you please provide a workaround or something? Even setting gunicorn-opts is not helping.

I face the same issue with version 1.10.0 and file system. All files are generated as expected, but the same “WORKIER TIMEOUT” message returns when I try to access individual records (i.e., clicking date’s hyperlinks).

It is the same for 1.7 without postgres backend. Could this issue be re-opened?