datasets: Load large text file for LM pre-training resulting in OOM
I tried to pretrain Longformer using transformers and datasets. But I got OOM issues with loading a large text file. My script is almost like this:
from datasets import load_dataset
@dataclass
class DataCollatorForDatasetsLanguageModeling(DataCollatorForLanguageModeling):
"""
Data collator used for language modeling based on DataCollatorForLazyLanguageModeling
- collates batches of tensors, honoring their tokenizer's pad_token
- preprocesses batches for masked language modeling
"""
block_size: int = 512
def __call__(self, examples: List[dict]) -> Dict[str, torch.Tensor]:
examples = [example['text'] for example in examples]
batch, attention_mask = self._tensorize_batch(examples)
if self.mlm:
inputs, labels = self.mask_tokens(batch)
return {"input_ids": inputs, "labels": labels}
else:
labels = batch.clone().detach()
if self.tokenizer.pad_token_id is not None:
labels[labels == self.tokenizer.pad_token_id] = -100
return {"input_ids": batch, "labels": labels}
def _tensorize_batch(self, examples: List[str]) -> Tuple[torch.Tensor, torch.Tensor]:
if self.tokenizer._pad_token is None:
raise ValueError(
"You are attempting to pad samples but the tokenizer you are using"
f" ({self.tokenizer.__class__.__name__}) does not have one."
)
tensor_examples = self.tokenizer.batch_encode_plus(
[ex for ex in examples if ex],
max_length=self.block_size,
return_tensors="pt",
pad_to_max_length=True,
return_attention_mask=True,
truncation=True,
)
input_ids, attention_mask = tensor_examples["input_ids"], tensor_examples["attention_mask"]
return input_ids, attention_mask
dataset = load_dataset('text', data_files='train.txt',cache_dir="./", , split='train')
data_collator = DataCollatorForDatasetsLanguageModeling(tokenizer=tokenizer, mlm=True,
mlm_probability=0.15, block_size=tokenizer.max_len)
trainer = Trainer(model=model, args=args, data_collator=data_collator,
train_dataset=train_dataset, prediction_loss_only=True, )
trainer.train(model_path=model_path)
This train.txt is about 1.1GB and has 90k lines where each line is a sequence of 4k words. During training, the memory usage increased fast as the following graph and resulted in OOM before the finish of training.
Could you please give me any suggestions on why this happened and how to fix it? Thanks.
About this issue
- Original URL
- State: open
- Created 4 years ago
- Comments: 27 (9 by maintainers)
@lhoestq sure. Here you have https://colab.research.google.com/drive/1ba09ZOpyHGAOQLcsxiQAHRXl10qnMU5o?usp=sharing let me know if the link works and it reproduces the issue. To me, it reproduces the issue, since if you start the training the ram memory keeps increasing.
Let me know. Thanks!
@lhoestq could be, but if we set wandb to false this should not happen. I am going to try.
This seems to be on the
transformers
library side.If you have more informations (pip env) or even better, a colab reproducing the error we can investigate.
There was a memory leak issue fixed recently in master. You should install from source and see if it fixes your problem.