transformers: clip_grad_norm on Multiple GPUs: (CUDA error: device-side assert triggered)

Environment info

  • transformers version: 4.0.0
  • Platform: Linux-5.4.0-53-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.9
  • PyTorch version (GPU?): 1.7.0 (True)
  • Tensorflow version (GPU?): 2.3.1 (True)
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes

Who can help

@LysandreJik @sgugger

Information

Model I am using (Bert, XLNet …): RoBERTa

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below) trainer.train() runs for a bit, then fails with the following output:
RuntimeError                              Traceback (most recent call last)
<ipython-input-11-3435b262f1ae> in <module>
----> 1 trainer.train()

~/anaconda3/envs/transformers/lib/python3.7/site-packages/transformers/trainer.py in train(self, model_path, trial)
    759                         torch.nn.utils.clip_grad_norm_(amp.master_params(self.optimizer), self.args.max_grad_norm)
    760                     else:
--> 761                         torch.nn.utils.clip_grad_norm_(model.parameters(), self.args.max_grad_norm)
    762 
    763                     if is_torch_tpu_available():

~/anaconda3/envs/transformers/lib/python3.7/site-packages/torch/nn/utils/clip_grad.py in clip_grad_norm_(parameters, max_norm, norm_type)
     33         total_norm = torch.norm(torch.stack([torch.norm(p.grad.detach(), norm_type).to(device) for p in parameters]), norm_type)
     34     clip_coef = max_norm / (total_norm + 1e-6)
---> 35     if clip_coef < 1:
     36         for p in parameters:
     37             p.grad.detach().mul_(clip_coef.to(p.grad.device))

RuntimeError: CUDA error: device-side assert triggered

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below) Training RoBERTa for sequence classification from text to binary.

To reproduce

Steps to reproduce the behavior:

  1. Load pre-processed dataset from disk using datasets.Dataset.load_from_disk()
  2. Instantiate RoBERTa from pretrained (roberta-base) with config mods (num_labels = 2)
  3. Create and run trainer. See full code below (most imports omitted).
from transformers import (RobertaTokenizerFast)

BLOCK_SIZE = 512
tok = RobertaTokenizerFast.from_pretrained("./art_tok_onefile_roberta_tuned/")

ds_root = '/media/b/My Passport/datasets/'
tokenized = datasets.Dataset.load_from_disk(os.path.join(ds_root, 'art_unit_tokenized_balanced'))

columns_to_return = ['input_ids', 'attention_mask', 'labels']
tokenized.set_format(type='torch', columns=columns_to_return)

from transformers import RobertaForSequenceClassification  

config = RobertaConfig(
    vocab_size=tok.vocab_size,
    max_position_embeddings=514,
    num_labels = 2
)

config = RobertaConfig.from_pretrained("roberta-base", 
                                        vocab_size=tok.vocab_size,
                                        max_position_embeddings=514,
                                        num_labels = 2)

model = RobertaForSequenceClassification.from_pretrained('roberta-base', config=config)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

for param in model.base_model.parameters():
    param.requires_grad = False

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./roberta_train_test",
    overwrite_output_dir=True,
    num_train_epochs=5,
    per_device_train_batch_size=128,
    save_steps=50,
    save_total_limit=2,
    logging_steps=10,
    #fp16 = True #Enable low-precision via AMP - omitted for now.
)

train_test_bal = tokenized.train_test_split(test_size=0.1)

trainer = Trainer(
    model=model,
    args=training_args,
    #data_collator=collate_fn,
    train_dataset=train_test_bal['train']
)

trainer.train()

Expected behavior

The model trains for the duration of the training cycle.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 21 (1 by maintainers)

Most upvoted comments

@sgugger I’ll work on making the datasets public and will post here. In the meantime, I’ll run your snippet.

@LysandreJik , it’s not a memory issue - all four GPUs are at ~87% volatile util for the duration.