mlflow: [BUG] Permissions issue when writing on /workspace

Willingness to contribute

No. I cannot contribute a bug fix at this time.

MLflow version

1.27.0

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 22.04.1 LTS
Python version: 3.9 (Mambaforge)
yarn version, if running the dev UI: n/a

Describe the problem

After terminating an unfinished run (while writing and debugging code), I see the following error in Python 3.9. Only solution I found so far is to remove the mlruns/ directory as there is some permission issue with unfinished runs, it appears. However I cannot always do it as, e.g., I have another training in progress while wanting to debug code that is not always running/completing successfully.

Is this a known issue with some known fix?

Tracking information

MLflow version: 1.27.0 Tracking URI: file:///home/local/dev/mlruns Artifact URI: file:///workspace/mlruns/0/5b986cb300594fe2b2730742acf6953a/artifacts

Code to reproduce issue

Cannot disclose code.

Stack trace

  File "/home/local/mambaforge3/envs/tf_dev/lib/python3.9/site-packages/mlflow/tracking/fluent.py", line 846, in log_dict
    MlflowClient().log_dict(run_id, dictionary, artifact_file)
  File "/home/local/mambaforge3/envs/tf_dev/lib/python3.9/site-packages/mlflow/tracking/client.py", line 1094, in log_dict
    json.dump(dictionary, f, indent=2)
  File "/home/local/mambaforge3/envs/tf_dev/lib/python3.9/contextlib.py", line 126, in __exit__
    next(self.gen)
  File "/home/local/mambaforge3/envs/tf_dev/lib/python3.9/site-packages/mlflow/tracking/client.py", line 1020, in _log_artifact_helper
    self.log_artifact(run_id, tmp_path, artifact_dir)
  File "/home/local/mambaforge3/envs/tf_dev/lib/python3.9/site-packages/mlflow/tracking/client.py", line 955, in log_artifact
    self._tracking_client.log_artifact(run_id, local_path, artifact_path)
  File "/home/local/mambaforge3/envs/tf_dev/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py", line 365, in log_artifact
    artifact_repo.log_artifact(local_path, artifact_path)
  File "/home/local/mambaforge3/envs/tf_dev/lib/python3.9/site-packages/mlflow/store/artifact/local_artifact_repo.py", line 37, in log_artifact
    mkdir(artifact_dir)
  File "/home/local/mambaforge3/envs/tf_dev/lib/python3.9/site-packages/mlflow/utils/file_utils.py", line 119, in mkdir
    raise e
  File "/home/local/mambaforge3/envs/tf_dev/lib/python3.9/site-packages/mlflow/utils/file_utils.py", line 116, in mkdir
    os.makedirs(target)
  File "/home/local/mambaforge3/envs/tf_dev/lib/python3.9/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/home/local/mambaforge3/envs/tf_dev/lib/python3.9/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/home/local/mambaforge3/envs/tf_dev/lib/python3.9/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  [Previous line repeated 1 more time]
  File "/home/local/mambaforge3/envs/tf_dev/lib/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/workspace'
python-BaseException

Process finished with exit code 130 (interrupted by signal 2: SIGINT)

Other info / logs

I run MLflow locally on a Linux desktop machine for debugging experimental code in Tensorflow. The issue appears when terminating abruptly previous runs.

What component(s) does this bug affect?

area/artifacts: Artifact stores and artifact logging
area/build: Build and test infrastructure for MLflow
area/docs: MLflow documentation pages
area/examples: Example code
area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
area/models: MLmodel format, model serialization/deserialization, flavors
area/pipelines: Pipelines, Pipeline APIs, Pipeline configs, Pipeline Templates
area/projects: MLproject format, project running backends
area/scoring: MLflow Model server, model deployment tools, Spark UDFs
area/server-infra: MLflow Tracking server backend
area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

What language(s) does this bug affect?

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 33 (14 by maintainers)

Most upvoted comments

This is because the artifact URI (when, e.g., deleting the mlruns/ folder in the current directory and “starting from scratch”) is then file:///home/localuser/dev/mlruns/0/b59819c7deb14880b7168f91f857f224/artifacts for example, instead of file:///workspace/mlruns/0/ea5634b7b072420f89415601b091189d/artifacts which happens when the previous run was not correctly terminated.

This is strange. I have no clue why this happens. Can you share the full code you use with sensitive information masked or create a script that can reproduce this behavior?

harupy on Aug 20, 2022