mlflow: Running MLflow training with Docker Env doesn't support GPUs
System information
- Have I written custom code (as opposed to using a stock example script provided in MLflow): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18.04
- MLflow installed from (source or binary): binary
- MLflow version (run
mlflow --version): 1.7.0 - Python version: 3.7.4
- npm version, if running the dev UI: N/A
Describe the problem
When running mlflow run with an MLproject that looks like:
name: nn-training
docker_env:
image: docker-image-with-cuda
entry_points:
train:
parameters:
config: {type: string}
command: "python train.py --config {config}"
I get the following message:
NVIDIA Release 20.02 (build 10315040)
PyTorch Version 1.5.0a0+3bbb36e
Container image Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
Copyright (c) 2014-2019 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use 'nvidia-docker run' to start this container; see
https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker .
I’ve tested this image locally, and simply passing --gpus all to docker run makes GPU functionality available. But I can’t find how to set this flag (or any other useful docker run flag) with mlflow.
Since mlflow is built for the ML life-cycle an training nets requires GPUs, I know I must be missing something regarding setting the docker environment’s runtime. Can someone point me in the right direction?
Code to reproduce issue
Build any image (nvcr.io/nvidia/pytorch:20.02-py3 in our case), and then run mlflow run with the docker env set to that image and no other docker_env params.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 15 (13 by maintainers)
@ksanjeevan I have commented on you PR, I think it would be great to get that in.
@Zethson I think we can consider specifying default docker arguments to the project file as well. Adding them to the cli is simpler though so I think we should start with that.
Thanks @tomasatdatabricks. So I’ve created this draft pr to show how it could work? I didn’t go into tests or doc or anything like that, just a straightforward implementation that kinda follows what’s being done for
--param-list😃It’d be awesome if you guys could implement it. If the general idea of the PR looks good to you but you don’t think you’ll be able to get to it, I can try to work on it thought it might take me longer. Let me know!
Cool, let me know if you would be interested in contributing this update, I’d be happy to review you PR. Otherwise we’ll try to get to it soon.
@tritab yeah thanks I’m going to try this workaround, but with GPU functionality being such a major feature I was hoping there was a way to this directly with the mlflow API.