mlflow: MLflow worker timeout when opening UI

System information

Have I written custom code (as opposed to using a stock example script provided in MLflow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04.5
MLflow installed from (source or binary): pip install mlflow
MLflow version (run mlflow --version): mlflow, version 0.8.2
Python version: Python 3.6.6 :: Anaconda, Inc.
**npm version (if running the dev UI):
Exact command to reproduce: mlflow server --file-store /bigdata/mlflow --host 0.0.0.0

Describe the problem

MLflow UI shows Niagara falls with “Oops! Something went wrong” every time I try opening it. I’ve been using it for two months, but recently it has started crashing until today I cannot get the UI to open at all anymore.

Logs

server logs after fresh restart:

[2019-02-26 12:34:36 +0000] [9] [INFO] Starting gunicorn 19.9.0
[2019-02-26 12:34:36 +0000] [9] [INFO] Listening at: http://0.0.0.0:5000 (9)
[2019-02-26 12:34:36 +0000] [9] [INFO] Using worker: sync
[2019-02-26 12:34:36 +0000] [12] [INFO] Booting worker with pid: 12
[2019-02-26 12:34:36 +0000] [14] [INFO] Booting worker with pid: 14
[2019-02-26 12:34:36 +0000] [15] [INFO] Booting worker with pid: 15
[2019-02-26 12:34:36 +0000] [18] [INFO] Booting worker with pid: 18
[2019-02-26 12:35:30 +0000] [9] [CRITICAL] WORKER TIMEOUT (pid:14)
[2019-02-26 12:35:30 +0000] [14] [INFO] Worker exiting (pid: 14)
[2019-02-26 12:35:30 +0000] [28] [INFO] Booting worker with pid: 28

browser console logs when opening UI:

setupAjaxHeaders.js:22 
{_xsrf: "2|a583f945|b32757069a3ea1c54e37f87dba1c1428|1549020795"}
service-worker.js:1 Uncaught (in promise) Error: Request for http://localhost:5000/static-files/static-files/static/css/main.fbf8a477.css returned a response with status 404
    at service-worker.js:1
service-worker.js:1 Uncaught (in promise) Error: Request for http://localhost:5000/static-files/static-files/static/css/main.fbf8a477.css returned a response with status 404
    at service-worker.js:1
jquery.js:9355 POST http://localhost:5000/ajax-api/2.0/preview/mlflow/runs/search net::ERR_EMPTY_RESPONSE
Actions.js:155 XHR failed 
{readyState: 0, getResponseHeader: ƒ, getAllResponseHeaders: ƒ, setRequestHeader: ƒ, overrideMimeType: ƒ, …}
react-dom.production.min.js:151 TypeError: Cannot read property 'getErrorCode' of undefined
    at errorRenderFunc (ExperimentPage.js:122)
    at e.value (RequestStateWrapper.js:51)
    at f (react-dom.production.min.js:131)
    at beginWork (react-dom.production.min.js:138)
    at o (react-dom.production.min.js:176)
    at a (react-dom.production.min.js:176)
    at x (react-dom.production.min.js:182)
    at y (react-dom.production.min.js:181)
    at v (react-dom.production.min.js:181)
    at d (react-dom.production.min.js:180)
AppErrorBoundary.js:19 TypeError: Cannot read property 'getErrorCode' of undefined
    at errorRenderFunc (ExperimentPage.js:122)
    at e.value (RequestStateWrapper.js:51)
    at f (react-dom.production.min.js:131)
    at beginWork (react-dom.production.min.js:138)
    at o (react-dom.production.min.js:176)
    at a (react-dom.production.min.js:176)
    at x (react-dom.production.min.js:182)
    at y (react-dom.production.min.js:181)
    at v (react-dom.production.min.js:181)
    at d (react-dom.production.min.js:180)
:5000/#/experiments/1:1 Uncaught (in promise) 
t {xhr: {…}}
```

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 2
Comments: 38 (4 by maintainers)

Most upvoted comments

Reopening this issue as per community request and reassign priority to get it into the queue.

+23

Zangr on Aug 20, 2020

mlflow 1.18, year 2021, issue is still here…

+22

sergeyleyko on Aug 3, 2021

@spott @gkonstanty : where are you adding the --gunicorn-opts "--timeout 180" option?

I couldn’t get mlflow ui --gunicorn-opts "--timeout 180" to work either (error: no such option --gunicorn-opts)

But the following worked for me: GUNICORN_CMD_ARGS="--timeout 180" mlflow ui

+16

5ke on Oct 1, 2020

Problem persists on 1.28.0 MLflow is great, but for me this bug renders it practically useless.

+12

ptavaressilva on Sep 25, 2022

mlflow 1.21 same issue…

+11

daniel-beyond on Nov 22, 2021

I have just upgraded server to run on 1.9.0 version (without postgres backend) and nothing has changed.

Adding --gunicorn-opts "--timeout 180" has somehow helped, but the number of our experiments is constantly growing, so even 180sec will be not sufficient soon. And waiting so long for some simple queries results is kinda annoying.

Could you please check this issue?

gkonstanty on Jun 23, 2020

Any update on this issue? It still exists on v1.2.0.0

datsabk on Aug 28, 2019

Sorry, @spott, I missed your msg.

I’m adding it to mlflow server:

mlflow server --host 0.0.0.0 -p 5000 --backend-store-uri /mlflow/data/ --default-artifact-root /mlflow/artifacts/ --gunicorn-opts "--timeout 180"

gkonstanty on Oct 1, 2020

Is there a way for this issue to get even higher priority? I don’t really understand how people use MLflow given that this bug exists. Perhaps it’s because most people aren’t using the UI feature…

alam-shahul on Aug 24, 2023

The problem is not only with a large number of experiments but with a large number of experiment metrics logged, like in my case. Thus the issue is due to a bad UI design, and I suppose it can be fixed with big refactoring only.

sergeyleyko on Jul 21, 2022

With a large timeout set, our problem seems to be only with the ui generating the experiments table with 1000+ experiments. I wonder if defaulting the experiments side bar to hidden would help as a short term fix? Collapsing the sidebar seems to fix the problem (after waiting a few minutes for it to load). Yes, it would break almost immediately when someone clicks to expand it, but a user could still use the ui if they knew the experiment id before hand.

Maybe these are really two separate issues. One for large runs and one for experiments on the home page?

jmahlik on Jul 20, 2022

I received somewhat better performance with this -

mlflow server --backend-store-uri=postgresql://postgres:${RDS_PASSWORD}@${RDS_HOST}:5432/mlflow --default-artifact-root=${ARTIFACT_STORE} --host 0.0.0.0 --port 5000 --gunicorn-opts “–worker-class gevent --threads 3 --workers 3 --timeout 300 --keep-alive 300 --log-level INFO”

dprateek1991 on Mar 1, 2022

This is unexpectedly unpleasant. I did a number of runs with the idea of sorting best-to-worst metrics afterwards. But the UI indeed crashes after more than ~1000 runs…

Moreover, it only has to load the first 100 runs and show “load more” afterwards.

pranasziaukas on Sep 12, 2019

It turns out that the ui simply doesn’t handle too many runs (in my case it starts struggling when mlruns contains more than circa 1000 experiments). Around this threshold the ui becomes unstable (sometimes crashes, sometimes works, but it’s never quick&responsive), and eventually there are too many runs and it won’t load at all.

This goes a bit against the philosophy of being able to track all your experiments.

Would using a local db instead of file storage help? Hosting externally is not an option for me.

As a side note: during troubleshooting I discovered that when you move runs around into different folders, it’s important to update the artifact_location parameter in the main meta.yaml, otherwise you’ll experience a different type of crash, without a clear warning.

5ke on Jun 17, 2019

I managed to get it fixed by (in my setup):

increasing the number of workers --workers 20 (6 should be enough because it’s exactly how many connections are allowed per domain in modern browsers, 20 for good measure).
switching to eventlet

It’s worth noting that the root cause, at least in my setup, has nothing to do with mlflow itself. I’m running it in Kubernetes (GKE) and in order to access it on my machine I’m using kubectl port-forward. It looks like the way kubectl is proxying the requests exhausts all workers and since they are synchronous by default no new connections can be accepted. Supposedly port-forwarding isn’t compatible with the event Gunicorn worker class.

@here Could you describe your setup a bit, please? Are you using kubectl port-forward or anything similar to it?

pbrit on Aug 29, 2023

Well I found one workaround here. Try mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://DB_USER:DB_PASSWD@DB_ENDPOINT:5432/DB_NAME --default-artifact-root s3://S3_BUCKET_NAME --gunicorn-opts "--timeout 0"

This will wait till the data transfer finishes and loads, I have to delete a few of my experiments from my S3 bucket to make it load a little faster which I think is not a very welcoming workaround.

Will have to wait for a permanent fix to this.

alokpadhi on Jun 28, 2022

mlflow 1.26.1 still same issue! Can you please provide a workaround or something? Even setting gunicorn-opts is not helping.

marioGab on Jun 23, 2022

I face the same issue with version 1.10.0 and file system. All files are generated as expected, but the same “WORKIER TIMEOUT” message returns when I try to access individual records (i.e., clicking date’s hyperlinks).

tmywada on Aug 19, 2020

It is the same for 1.7 without postgres backend. Could this issue be re-opened?

gkonstanty on Apr 29, 2020