transformers: clip_grad_norm on Multiple GPUs: (CUDA error: device-side assert triggered)
Environment info
transformers
version: 4.0.0- Platform: Linux-5.4.0-53-generic-x86_64-with-debian-buster-sid
- Python version: 3.7.9
- PyTorch version (GPU?): 1.7.0 (True)
- Tensorflow version (GPU?): 2.3.1 (True)
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes
Who can help
Information
Model I am using (Bert, XLNet …): RoBERTa
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below) trainer.train() runs for a bit, then fails with the following output:
RuntimeError Traceback (most recent call last)
<ipython-input-11-3435b262f1ae> in <module>
----> 1 trainer.train()
~/anaconda3/envs/transformers/lib/python3.7/site-packages/transformers/trainer.py in train(self, model_path, trial)
759 torch.nn.utils.clip_grad_norm_(amp.master_params(self.optimizer), self.args.max_grad_norm)
760 else:
--> 761 torch.nn.utils.clip_grad_norm_(model.parameters(), self.args.max_grad_norm)
762
763 if is_torch_tpu_available():
~/anaconda3/envs/transformers/lib/python3.7/site-packages/torch/nn/utils/clip_grad.py in clip_grad_norm_(parameters, max_norm, norm_type)
33 total_norm = torch.norm(torch.stack([torch.norm(p.grad.detach(), norm_type).to(device) for p in parameters]), norm_type)
34 clip_coef = max_norm / (total_norm + 1e-6)
---> 35 if clip_coef < 1:
36 for p in parameters:
37 p.grad.detach().mul_(clip_coef.to(p.grad.device))
RuntimeError: CUDA error: device-side assert triggered
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below) Training RoBERTa for sequence classification from text to binary.
To reproduce
Steps to reproduce the behavior:
- Load pre-processed dataset from disk using datasets.Dataset.load_from_disk()
- Instantiate RoBERTa from pretrained (roberta-base) with config mods (num_labels = 2)
- Create and run trainer. See full code below (most imports omitted).
from transformers import (RobertaTokenizerFast)
BLOCK_SIZE = 512
tok = RobertaTokenizerFast.from_pretrained("./art_tok_onefile_roberta_tuned/")
ds_root = '/media/b/My Passport/datasets/'
tokenized = datasets.Dataset.load_from_disk(os.path.join(ds_root, 'art_unit_tokenized_balanced'))
columns_to_return = ['input_ids', 'attention_mask', 'labels']
tokenized.set_format(type='torch', columns=columns_to_return)
from transformers import RobertaForSequenceClassification
config = RobertaConfig(
vocab_size=tok.vocab_size,
max_position_embeddings=514,
num_labels = 2
)
config = RobertaConfig.from_pretrained("roberta-base",
vocab_size=tok.vocab_size,
max_position_embeddings=514,
num_labels = 2)
model = RobertaForSequenceClassification.from_pretrained('roberta-base', config=config)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
for param in model.base_model.parameters():
param.requires_grad = False
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./roberta_train_test",
overwrite_output_dir=True,
num_train_epochs=5,
per_device_train_batch_size=128,
save_steps=50,
save_total_limit=2,
logging_steps=10,
#fp16 = True #Enable low-precision via AMP - omitted for now.
)
train_test_bal = tokenized.train_test_split(test_size=0.1)
trainer = Trainer(
model=model,
args=training_args,
#data_collator=collate_fn,
train_dataset=train_test_bal['train']
)
trainer.train()
Expected behavior
The model trains for the duration of the training cycle.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 21 (1 by maintainers)
@sgugger I’ll work on making the datasets public and will post here. In the meantime, I’ll run your snippet.
@LysandreJik , it’s not a memory issue - all four GPUs are at ~87% volatile util for the duration.