transformers: Can't Select Specific GPU by TrainingArguments
Environment info
transformers
version: 4.8.2- Platform: Jupyter Notebook on Ubuntu
- Python version: 3.7
- PyTorch version (GPU?): 1.8.0+cu111
- Using GPU in script?: No, By Jupyter Notebook
- Using distributed or parallel set-up in script?:It is distributed but I don’t want that
Who can help
- trainer: @sgugger find by git-blame: @philschmid
To reproduce
By TrainingArguments, I want to set up my compute device only to torch.device(type=‘cuda’, index=1).
If I not set local_rank when init TrainingArguments, it will compute on both GPU.
Steps to reproduce the behavior:
from transformers import TrainingArguments, Trainer, EvalPrediction
training_args = TrainingArguments(
learning_rate=1e-4,
num_train_epochs=6,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
logging_steps=200,
output_dir="./training_output",
overwrite_output_dir=True,
# The next line is important to ensure the dataset labels are properly passed to the model
remove_unused_columns=False,
local_rank= 1
)
Then you will get ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
But after I set
import os
os.environ["RANK"]="1"
I get ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set
These error not happen if I not set local_rank when init TrainingArguments even though I don’t set any environment variable.
Expected behavior
I want to set up my compute device only to torch.device(type=‘cuda’, index=1).
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 26 (1 by maintainers)
In Jupyter Notebook, we can use one of these:
You need to set the variable before launching the jupyter notebook
You should use the env variable
CUDA_VISIBLE_DEVICES
to set the GPUs you want to use. If you have multiple GPUs available, theTrainer
will use all of them, that is expected and not a bug.Referring to all above solutions, all my GPUs are running or get CUDA device errors. As alternatives, I override TrainingArguments Class. However, it might have undiscovered issues though.
_setup_devices
in TrainingArguments controls overall device setting._setup_devices
(since it relates multiple dependent functions), I manually setdevice
method andn_gpus
method. In this case, I don’t need to give anyos.environ
orCUDA_VISIBLE_DEVICES
in front of python commands for single use. However, it may require if you want to use selected two or three gpus out of 4.This is normal, PyTorch names all visible devices from 0 to the number -1. So cuda0 in PyTorch is the first device you set as available, in this case GPU 2.
It works. I just have to comment out the
@torch_required
and addimport torch
at line 1, then I can freely choose whatever GPU I want. Thanks a million.It’s very inconvenient each time to restart jupyter lab/notebook to just change the device. Also, I may want to use several notebooks on different devices. PytorchLightening, for example, gives you freedom to select device for each run.
Ahhh, thank you. That successfully restricts the GPUs accessed in the notebook.
No you need to set that environment variable with the launch command, not inside your training script: