datasets: Load large text file for LM pre-training resulting in OOM

I tried to pretrain Longformer using transformers and datasets. But I got OOM issues with loading a large text file. My script is almost like this:

from datasets import load_dataset

@dataclass
class DataCollatorForDatasetsLanguageModeling(DataCollatorForLanguageModeling):
    """
    Data collator used for language modeling based on DataCollatorForLazyLanguageModeling
    - collates batches of tensors, honoring their tokenizer's pad_token
    - preprocesses batches for masked language modeling
    """

    block_size: int = 512

    def __call__(self, examples: List[dict]) -> Dict[str, torch.Tensor]:
        examples = [example['text'] for example in examples]
        batch, attention_mask = self._tensorize_batch(examples)
        if self.mlm:
            inputs, labels = self.mask_tokens(batch)
            return {"input_ids": inputs, "labels": labels}
        else:
            labels = batch.clone().detach()
            if self.tokenizer.pad_token_id is not None:
                labels[labels == self.tokenizer.pad_token_id] = -100
            return {"input_ids": batch, "labels": labels}

    def _tensorize_batch(self, examples: List[str]) -> Tuple[torch.Tensor, torch.Tensor]:

        if self.tokenizer._pad_token is None:
            raise ValueError(
                "You are attempting to pad samples but the tokenizer you are using"
                f" ({self.tokenizer.__class__.__name__}) does not have one."
            )

        tensor_examples = self.tokenizer.batch_encode_plus(
            [ex for ex in examples if ex],
            max_length=self.block_size,
            return_tensors="pt",
            pad_to_max_length=True,
            return_attention_mask=True,
            truncation=True,
        )

        input_ids, attention_mask = tensor_examples["input_ids"], tensor_examples["attention_mask"]
        return input_ids, attention_mask

dataset = load_dataset('text', data_files='train.txt',cache_dir="./", , split='train')
data_collator = DataCollatorForDatasetsLanguageModeling(tokenizer=tokenizer, mlm=True, 
                      mlm_probability=0.15, block_size=tokenizer.max_len)
trainer = Trainer(model=model, args=args, data_collator=data_collator,
                      train_dataset=train_dataset, prediction_loss_only=True, )
trainer.train(model_path=model_path)

This train.txt is about 1.1GB and has 90k lines where each line is a sequence of 4k words. During training, the memory usage increased fast as the following graph and resulted in OOM before the finish of training.

Could you please give me any suggestions on why this happened and how to fix it? Thanks.

About this issue

Original URL
State: open
Created 4 years ago
Comments: 27 (9 by maintainers)

Most upvoted comments

@lhoestq sure. Here you have https://colab.research.google.com/drive/1ba09ZOpyHGAOQLcsxiQAHRXl10qnMU5o?usp=sharing let me know if the link works and it reproduces the issue. To me, it reproduces the issue, since if you start the training the ram memory keeps increasing.

Let me know. Thanks!

gaceladri on Feb 15, 2021

@lhoestq could be, but if we set wandb to false this should not happen. I am going to try.

gaceladri on Feb 15, 2021

This seems to be on the transformers library side.

If you have more informations (pip env) or even better, a colab reproducing the error we can investigate.

thomwolf on Oct 5, 2020

There was a memory leak issue fixed recently in master. You should install from source and see if it fixes your problem.

sgugger on Sep 17, 2020