transformers: Can't Select Specific GPU by TrainingArguments

Environment info

  • transformers version: 4.8.2
  • Platform: Jupyter Notebook on Ubuntu
  • Python version: 3.7
  • PyTorch version (GPU?): 1.8.0+cu111
  • Using GPU in script?: No, By Jupyter Notebook
  • Using distributed or parallel set-up in script?:It is distributed but I don’t want that

Who can help

To reproduce

By TrainingArguments, I want to set up my compute device only to torch.device(type=‘cuda’, index=1).

If I not set local_rank when init TrainingArguments, it will compute on both GPU.

Steps to reproduce the behavior:

from transformers import TrainingArguments, Trainer, EvalPrediction

training_args = TrainingArguments(
    learning_rate=1e-4,
    num_train_epochs=6,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    logging_steps=200,
    output_dir="./training_output",
    overwrite_output_dir=True,
    # The next line is important to ensure the dataset labels are properly passed to the model
    remove_unused_columns=False,
    local_rank= 1
)

Then you will get ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

But after I set

import os
os.environ["RANK"]="1"

I get ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set

These error not happen if I not set local_rank when init TrainingArguments even though I don’t set any environment variable.

Expected behavior

I want to set up my compute device only to torch.device(type=‘cuda’, index=1).

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 26 (1 by maintainers)

Most upvoted comments

In Jupyter Notebook, we can use one of these:

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"
%env CUDA_DEVICE_ORDER=PCI_BUS_ID
%env CUDA_VISIBLE_DEVICES=0

You need to set the variable before launching the jupyter notebook

CUDA_VISIBLE_DEVICES="0" jupyter notebook

You should use the env variable CUDA_VISIBLE_DEVICES to set the GPUs you want to use. If you have multiple GPUs available, the Trainer will use all of them, that is expected and not a bug.

Referring to all above solutions, all my GPUs are running or get CUDA device errors. As alternatives, I override TrainingArguments Class. However, it might have undiscovered issues though.

  • backgrounds : I have more than one GPUs. Using huggingface trainer, all devices are involved in training.
  • problems : Trainer seems to use ddp after checking device and n_gpus method in TrainingArugments , and _setup_devices in TrainingArguments controls overall device setting.
  • temporary remedies : Instead of overriding _setup_devices (since it relates multiple dependent functions), I manually set device method and n_gpus method. In this case, I don’t need to give any os.environ or CUDA_VISIBLE_DEVICES in front of python commands for single use. However, it may require if you want to use selected two or three gpus out of 4.
class customTrainingArguments(TrainingArguments):
    def __init__(self,*args, **kwargs):
        super(customTrainingArguments, self).__init__(*args, **kwargs)

    @property
    @torch_required
    def device(self) -> "torch.device":
        """
        The device used by this process.
        Name the device the number you use.
        """
        return torch.device("cuda:3")

    @property
    @torch_required
    def n_gpu(self):
        """
        The number of GPUs used by this process.
        Note:
            This will only be greater than one when you have multiple GPUs available but are not using distributed
            training. For distributed training, it will always be 1.
        """
        # Make sure `self._n_gpu` is properly setup.
        # _ = self._setup_devices
        # I set to one manullay
        self._n_gpu = 1
        return self._n_gpu

This is normal, PyTorch names all visible devices from 0 to the number -1. So cuda0 in PyTorch is the first device you set as available, in this case GPU 2.

Referring to all above solutions, all my GPUs are running or get CUDA device errors. As alternatives, I override TrainingArguments Class. However, it might have undiscovered issues though.

  • backgrounds : I have more than one GPUs. Using huggingface trainer, all devices are involved in training.
  • problems : Trainer seems to use ddp after checking device and n_gpus method in TrainingArugments , and _setup_devices in TrainingArguments controls overall device setting.
  • temporary remedies : Instead of overriding _setup_devices (since it relates multiple dependent functions), I manually set device method and n_gpus method. In this case, I don’t need to give any os.environ or CUDA_VISIBLE_DEVICES in front of python commands for single use. However, it may require if you want to use selected two or three gpus out of 4.
class customTrainingArguments(TrainingArguments):
    def __init__(self,*args, **kwargs):
        super(customTrainingArguments, self).__init__(*args, **kwargs)

    @property
    @torch_required
    def device(self) -> "torch.device":
        """
        The device used by this process.
        Name the device the number you use.
        """
        return torch.device("cuda:3")

    @property
    @torch_required
    def n_gpu(self):
        """
        The number of GPUs used by this process.
        Note:
            This will only be greater than one when you have multiple GPUs available but are not using distributed
            training. For distributed training, it will always be 1.
        """
        # Make sure `self._n_gpu` is properly setup.
        # _ = self._setup_devices
        # I set to one manullay
        self._n_gpu = 1
        return self._n_gpu

It works. I just have to comment out the @torch_required and add import torch at line 1, then I can freely choose whatever GPU I want. Thanks a million.

You need to set the variable before launching the jupyter notebook

CUDA_VISIBLE_DEVICES="0" jupyter notebook

It’s very inconvenient each time to restart jupyter lab/notebook to just change the device. Also, I may want to use several notebooks on different devices. PytorchLightening, for example, gives you freedom to select device for each run.

Ahhh, thank you. That successfully restricts the GPUs accessed in the notebook.

No you need to set that environment variable with the launch command, not inside your training script:

CUDA_VISIBLE_DEVICES="0" python main.py