transformers: Can't Select Specific GPU by TrainingArguments

Environment info

transformers version: 4.8.2
Platform: Jupyter Notebook on Ubuntu
Python version: 3.7
PyTorch version (GPU?): 1.8.0+cu111
Using GPU in script?: No, By Jupyter Notebook
Using distributed or parallel set-up in script?:It is distributed but I don’t want that

Who can help

trainer: @sgugger find by git-blame: @philschmid

To reproduce

By TrainingArguments, I want to set up my compute device only to torch.device(type=‘cuda’, index=1).

If I not set local_rank when init TrainingArguments, it will compute on both GPU.

Steps to reproduce the behavior:

from transformers import TrainingArguments, Trainer, EvalPrediction

training_args = TrainingArguments(
    learning_rate=1e-4,
    num_train_epochs=6,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    logging_steps=200,
    output_dir="./training_output",
    overwrite_output_dir=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=False,
    local_rank= 1
)

Then you will get ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

But after I set

import os
os.environ["RANK"]="1"

I get ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set

These error not happen if I not set local_rank when init TrainingArguments even though I don’t set any environment variable.

Expected behavior

I want to set up my compute device only to torch.device(type=‘cuda’, index=1).

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 26 (1 by maintainers)

Most upvoted comments

In Jupyter Notebook, we can use one of these:

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"

%env CUDA_DEVICE_ORDER=PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES=0

+11

ymoslem on Apr 25, 2022

You need to set the variable before launching the jupyter notebook

CUDA_VISIBLE_DEVICES="0" jupyter notebook

+10

sgugger on Oct 27, 2021

You should use the env variable CUDA_VISIBLE_DEVICES to set the GPUs you want to use. If you have multiple GPUs available, the Trainer will use all of them, that is expected and not a bug.

sgugger on Jul 7, 2021

Referring to all above solutions, all my GPUs are running or get CUDA device errors. As alternatives, I override TrainingArguments Class. However, it might have undiscovered issues though.

backgrounds : I have more than one GPUs. Using huggingface trainer, all devices are involved in training.
problems : Trainer seems to use ddp after checking device and n_gpus method in TrainingArugments , and _setup_devices in TrainingArguments controls overall device setting.
temporary remedies : Instead of overriding _setup_devices (since it relates multiple dependent functions), I manually set device method and n_gpus method. In this case, I don’t need to give any os.environ or CUDA_VISIBLE_DEVICES in front of python commands for single use. However, it may require if you want to use selected two or three gpus out of 4.

class customTrainingArguments(TrainingArguments):
    def __init__(self,*args, **kwargs):
        super(customTrainingArguments, self).__init__(*args, **kwargs)

    @property
    @torch_required
    def device(self) -> "torch.device":
        """
        The device used by this process.
        Name the device the number you use.
        """
        return torch.device("cuda:3")

    @property
    @torch_required
    def n_gpu(self):
        """
        The number of GPUs used by this process.
        Note:
            This will only be greater than one when you have multiple GPUs available but are not using distributed
            training. For distributed training, it will always be 1.
        """
        # Make sure `self._n_gpu` is properly setup.
        # _ = self._setup_devices
        # I set to one manullay
        self._n_gpu = 1
        return self._n_gpu

kimcando on May 21, 2022

This is normal, PyTorch names all visible devices from 0 to the number -1. So cuda0 in PyTorch is the first device you set as available, in this case GPU 2.

sgugger on Sep 2, 2021

Referring to all above solutions, all my GPUs are running or get CUDA device errors. As alternatives, I override TrainingArguments Class. However, it might have undiscovered issues though.

backgrounds : I have more than one GPUs. Using huggingface trainer, all devices are involved in training.

problems : Trainer seems to use ddp after checking device and n_gpus method in TrainingArugments , and _setup_devices in TrainingArguments controls overall device setting.

temporary remedies : Instead of overriding _setup_devices (since it relates multiple dependent functions), I manually set device method and n_gpus method. In this case, I don’t need to give any os.environ or CUDA_VISIBLE_DEVICES in front of python commands for single use. However, it may require if you want to use selected two or three gpus out of 4.
class customTrainingArguments(TrainingArguments):
    def __init__(self,*args, **kwargs):
        super(customTrainingArguments, self).__init__(*args, **kwargs)

    @property
    @torch_required
    def device(self) -> "torch.device":
        """
        The device used by this process.
        Name the device the number you use.
        """
        return torch.device("cuda:3")

    @property
    @torch_required
    def n_gpu(self):
        """
        The number of GPUs used by this process.
        Note:
            This will only be greater than one when you have multiple GPUs available but are not using distributed
            training. For distributed training, it will always be 1.
        """
        # Make sure `self._n_gpu` is properly setup.
        # _ = self._setup_devices
        # I set to one manullay
        self._n_gpu = 1
        return self._n_gpu

It works. I just have to comment out the @torch_required and add import torch at line 1, then I can freely choose whatever GPU I want. Thanks a million.

shivanraptor on Mar 29, 2023

You need to set the variable before launching the jupyter notebook
CUDA_VISIBLE_DEVICES="0" jupyter notebook

It’s very inconvenient each time to restart jupyter lab/notebook to just change the device. Also, I may want to use several notebooks on different devices. PytorchLightening, for example, gives you freedom to select device for each run.

prohor33 on Jan 1, 2022

Ahhh, thank you. That successfully restricts the GPUs accessed in the notebook.

evan-person on Oct 27, 2021

No you need to set that environment variable with the launch command, not inside your training script:

CUDA_VISIBLE_DEVICES="0" python main.py

sgugger on Oct 27, 2021