transformers: TrainingArguments does not support `mps` device (Mac M1 GPU)

System Info

  • transformers version: 4.21.0.dev0
  • Platform: macOS-12.4-arm64-arm-64bit
  • Python version: 3.8.9
  • Huggingface_hub version: 0.8.1
  • PyTorch version (GPU?): 1.12.0 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

export TASK_NAME=wnli
python run_glue.py \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3 \
  --output_dir /tmp/$TASK_NAME/

Expected behavior

When running the Trainer.train on a machine with an MPS GPU, it still just uses the CPU. I expected it to use the MPS GPU. This is supported by torch in the newest version 1.12.0, and we can check if the MPS GPU is available using torch.backends.mps.is_available().

It seems like the issue lies in the TrainingArguments._setup_devices method, which doesn’t appear to allow for the case where device = "mps".

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 3
  • Comments: 23 (8 by maintainers)

Most upvoted comments

A simple hack fixed the issue, by simply overwriting the device attribute of TrainingArguments:

import torch
from transformers import TrainingArguments


class TrainingArgumentsWithMPSSupport(TrainingArguments):

    @property
    def device(self) -> torch.device:
        if torch.cuda.is_available():
            return torch.device("cuda")
        elif torch.backends.mps.is_available():
            return torch.device("mps")
        else:
            return torch.device("cpu")

This at least shows that it might just be the aforementioned _setup_devices that needs changing.

Hello @V-Sher, it is yet to be released. For time being, you can install transformers from the source to use this feature via the below command

pip install git+https://github.com/huggingface/transformers

Now that PyTorch 1.12.1 is out I think we should reopen this issue! cc @pacman100

This is not supported yet, as this has been introduced by PyTorch 1.12, which also breaks all speech models due to a regression there. We will look into the support for Mac M1 GPUs once we officially support PyTorch 1.12 (probably won’t be before they do a patch 1.12.1).

Another observation: Some PyTorch operations have not been implemented in mps and will throw an error. One way to get around that is to set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1, which will fallback to CPU for these operations. It still throws a UserWarning however.

bitsandbytes

bitsandbytes does not work for mps.

After installing transformers package from source as suggested by @pacman100 like this:

pip install git+https://github.com/huggingface/transformers

the mps device is used with the standard TrainingArguments class. Does not require the custom TrainingArgumentsWithMPSSupport class.

Now the M1 Mac GPU is ~90% utilized. Screenshot 2023-06-14 at 16 03 57

Hi All: I am finetuning a BERT model with HuggingFace Trainer API in Mac OS Ventura (Intel), Python 3.10 and Torch 2.0.0. It takes 14 min in a simple scenery with CPU, with no problem. I changed to GPU with mps. Initially, GPU was not used, but after redefining TrainingArguments in this way, it worked

class TrainingArgumentsWithMPSSupport(TrainingArguments):
    @property
    def device(self) -> torch.device:
        return torch.device(device)

training_args = TrainingArgumentsWithMPSSupport(...)

But the problem is that improvement over CPU is scarce (barely from 14 min to 10 min). Monitor says %GPU is only 15% peak.

Any idea about why such poor improvement?

Thanks for any help Alberto

The is the full code

from transformers import BertForSequenceClassification, BertTokenizerFast, Trainer, TrainingArguments
import nlp
import torch
from torch.utils.data import Dataset, DataLoader

device = torch.device("mps:0")

_DATASET = '../IMDB.csv'

dataset = nlp.load_dataset('csv', data_files=[_DATASET], split='train[:1%]')

dataset = dataset.train_test_split(test_size=0.3)
train_set = dataset['train']  
test_set = dataset['test']


class CustomDataset(Dataset):

    def __init__(self, dataset, mytokenizer):
        self.tokenizer = mytokenizer
        self.dataset = dataset
        self.texts = dataset["text"] 

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        theText = self.dataset[index]['text']
        theLabel = self.dataset[index]['label']
        inputs = self.tokenizer(theText, max_length=512, padding='max_length', truncation=True)
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]

        ids = torch.tensor(ids, dtype=torch.long).to(device)
        mask = torch.tensor(mask, dtype=torch.long).to(device)
        theLabel = torch.tensor(theLabel, dtype=torch.long).to(device)

        result = {
            'input_ids': ids,
            'attention_mask': mask,
            'label': theLabel
        }

        return result


model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

training_set = CustomDataset(train_set, tokenizer)
testing_set = CustomDataset(test_set, tokenizer)

batch_size = 8
epochs = 2
warmup_steps = 500
weight_decay = 0.01


class TrainingArgumentsWithMPSSupport(TrainingArguments):
    @property
    def device(self) -> torch.device:
        return torch.device(device)



training_args = TrainingArgumentsWithMPSSupport(
	output_dir='./results',
	num_train_epochs=epochs,
	per_device_train_batch_size=batch_size,
	per_device_eval_batch_size=batch_size,
	warmup_steps=warmup_steps,
	weight_decay=weight_decay,
	# evaluate_during_training=True,
	evaluation_strategy='steps',
	logging_dir='./logs',
)

trainer = Trainer(
	model=model.to(device),
	args=training_args,
	train_dataset=training_set,
	eval_dataset=testing_set
)

trainer.train()  # full finetune
trainer.evaluate()

We’ve also observed a drop in metrics when training, see this issue.

I have no idea, since we haven’t tried and tested it out yet. And as I said our whole CI is constrained by PyTorch < 1.12 right now, so until that pin is dropped we can’t test the integration 😃. You can certainly try it on your own fork in the meantime!