mlflow: Running MLflow training with Docker Env doesn't support GPUs

System information

Have I written custom code (as opposed to using a stock example script provided in MLflow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
MLflow installed from (source or binary): binary
MLflow version (run mlflow --version): 1.7.0
Python version: 3.7.4
npm version, if running the dev UI: N/A

Describe the problem

When running mlflow run with an MLproject that looks like:

name: nn-training

docker_env: 
    image: docker-image-with-cuda

entry_points:
    train:
        parameters:
            config: {type: string}
        command: "python train.py --config {config}"

I get the following message:

NVIDIA Release 20.02 (build 10315040)
PyTorch Version 1.5.0a0+3bbb36e

Container image Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.

Copyright (c) 2014-2019 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use 'nvidia-docker run' to start this container; see
   https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker .

I’ve tested this image locally, and simply passing --gpus all to docker run makes GPU functionality available. But I can’t find how to set this flag (or any other useful docker run flag) with mlflow.

Since mlflow is built for the ML life-cycle an training nets requires GPUs, I know I must be missing something regarding setting the docker environment’s runtime. Can someone point me in the right direction?

Code to reproduce issue

Build any image (nvcr.io/nvidia/pytorch:20.02-py3 in our case), and then run mlflow run with the docker env set to that image and no other docker_env params.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 15 (13 by maintainers)

Most upvoted comments

@ksanjeevan I have commented on you PR, I think it would be great to get that in.

@Zethson I think we can consider specifying default docker arguments to the project file as well. Adding them to the cli is simpler though so I think we should start with that.

tomasatdatabricks on Mar 26, 2020

Thanks @tomasatdatabricks. So I’ve created this draft pr to show how it could work? I didn’t go into tests or doc or anything like that, just a straightforward implementation that kinda follows what’s being done for --param-list 😃

It’d be awesome if you guys could implement it. If the general idea of the PR looks good to you but you don’t think you’ll be able to get to it, I can try to work on it thought it might take me longer. Let me know!

ksanjeevan on Mar 20, 2020

Cool, let me know if you would be interested in contributing this update, I’d be happy to review you PR. Otherwise we’ll try to get to it soon.

tomasatdatabricks on Mar 19, 2020

@tritab yeah thanks I’m going to try this workaround, but with GPU functionality being such a major feature I was hoping there was a way to this directly with the mlflow API.

ksanjeevan on Mar 19, 2020