transformers: Trainer errors out when concatenating different sequence length batches with distributed training and IterableDataset

System Info

  • transformers version: 4.33.3
  • Platform: Linux-5.10.186-179.751.amzn2.x86_64-x86_64-with-glibc2.10
  • Python version: 3.8.17
  • Huggingface_hub version: 0.17.3
  • Safetensors version: 0.3.3
  • Accelerate version: 0.23.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: A100
  • Using distributed or parallel set-up in script?: torchrun --nproc-per-node 2 script.py

Who can help?

@muellerzr, @pacman100

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction


    import torch
    from torch.utils.data import IterableDataset
    from transformers import (
        AutoModelForMaskedLM,
        AutoTokenizer,
        DataCollatorForLanguageModeling,
        Trainer,
        TrainingArguments,
    )

    data = [
        {
            "input_ids": torch.tensor([101, 2040, 2001, 1999, 14936, 102]),
            "token_type_ids": torch.tensor([0, 0, 0, 0, 0, 0]),
            "attention_mask": torch.tensor([1, 1, 1, 1, 1, 1]),
        },
        {
            "input_ids": torch.tensor([101, 2040, 102]),
            "token_type_ids": torch.tensor([0, 0, 0]),
            "attention_mask": torch.tensor([1, 1, 1]),
        },
        {
            "input_ids": torch.tensor([101, 2040, 2001, 1999]),
            "token_type_ids": torch.tensor([0, 0, 0, 0]),
            "attention_mask": torch.tensor([1, 1, 1, 1]),
        },
        {
            "input_ids": torch.tensor([101, 2040, 2001, 1999, 14936, 102]),
            "token_type_ids": torch.tensor([0, 0, 0, 0, 0, 0]),
            "attention_mask": torch.tensor([1, 1, 1, 1, 1, 1]),
        },
        {
            "input_ids": torch.tensor([101]),
            "token_type_ids": torch.tensor([00]),
            "attention_mask": torch.tensor([1]),
        },
        {
            "input_ids": torch.tensor([101]),
            "token_type_ids": torch.tensor([00]),
            "attention_mask": torch.tensor([1]),
        },
    ]

    class ExampleDataset(IterableDataset):
        def __init__(self, data):
            super().__init__()
            self.data = data * 20

        def __iter__(self):
            for x in self.data:
                yield x

        def __len__(self):
            return len(self.data)

    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
    model = AutoModelForMaskedLM.from_pretrained("bert-base-cased")
    train_args = TrainingArguments(
        output_dir="output",
        num_train_epochs=3,
        per_device_train_batch_size=2,
    )
    dc = DataCollatorForLanguageModeling(tokenizer=tokenizer)

    trainer = Trainer(
        train_dataset=ExampleDataset(data),
        model=model,
        args=train_args,
        data_collator=dc,
    )
    trainer.train()

I run the above script with the command torchrun --nproc-per-node 2 script.py. This results in the following error.

Traceback (most recent call last):
  File "fm_model/data/scratch.py", line 242, in <module>
    trainer.train()
  File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/transformers/trainer.py", line 1556, in train
    return inner_training_loop(
  File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/transformers/trainer.py", line 1816, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/accelerate/data_loader.py", line 597, in __iter__
    next_batch, next_batch_info = self._fetch_batches(main_iterator)
  File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/accelerate/data_loader.py", line 528, in _fetch_batches
    batch = concatenate(batches, dim=0)
  File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/accelerate/utils/operations.py", line 496, in concatenate
    return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
  File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/accelerate/utils/operations.py", line 496, in <dictcomp>
    return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
  File "/opt/conda/envs/fmmodel/lib/python3.8/site-packages/accelerate/utils/operations.py", line 499, in concatenate
    return torch.cat(data, dim=dim)
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1 but got size 6 for tensor number 1 in the list.

This is due to the fact that in Trainer there are no arguments that can be passed to prepare the dataloader with split_batches so this errors out when running this line. This occurs since there is no padding done across batches before these are concatenated together.

In order to be able to use an iterable dataset with Trainer, something probably needs to be changed in accelerate or the Trainer to enable distributed dataloading when the batches end up being different lengths.

Expected behavior

  1. Automatic padding in accelerate when the batches produced have different lengths OR
  2. A way to specify split_batches where a full batch is produced then split for all the different processes

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Reactions: 1
  • Comments: 26 (8 by maintainers)

Most upvoted comments

We’re aware of it and working towards a solution

Appreciate for the maintainers’ efforts on this issue. Is there any update or temporal solution to fix this issue?

If you were using dispatch_batches=False and you couldn’t do epoch based training, @muellerzr added this so now that route is feasible with accelerate + iterable datasets https://github.com/huggingface/accelerate/pull/2066

We’ve released a patch in Accelerate which will give a more clear error on what’s going on, and will be doing similar into the Trainer here shortly to give direct instructions on what to do 😃

Hi everyone, the issue is not directly linked to IterableDataset but more on the dispatch_batches arg that is set to True when passing a IterableDataset.

With dispatch_batches=True, we process the data on the main process and broadcast to the other processes. This is more reliable + less compute since we do it on only one process. However, we indeed have an issue when the data doesn’t have the same size as we are trying to concat tensors that might not have the same size.

As a temporary solution, you can either:

  • pass dispatch_batches=False in TrainingArguments. This is the default behavior of Trainer before accelerate integration. It will use IterableDatasetShard instead of DataLoaderDispatcher.
  • pass split_batches=True. It will split a full batch into self.num_process parts. So, it requires that the batch size ( per_device_train_batch_size ) of the dataloader to be a round multiple of the number of processes. For example, if you set per_device_train_batch_size = 16 and you have 4 processes, each process will have a batch_size of 4.

Related code:

if self.split_batches:
    # One batch of the main iterator is dispatched and split.
    batch = next(iterator)
else:
    # num_processes batches of the main iterator are concatenated then dispatched and split.
    # We add the batches one by one so we have the remainder available when drop_last=False.
    batches = []
    for _ in range(self.state.num_processes):
        batches.append(next(iterator))
   # The issue is here since the batches do not necessarily have the same size. 
    batch = concatenate(batches, dim=0)

also, confirming that downgrading to transformers 4.30 is a temporary workaround (also requires downgrading to accelerate==0.20.3) when using the Trainer API (doesn’t appear to work outside the Trainer API with these versions)

Thanks @ssharpe42 and @dwyatte for flagging and diagnosing the issue here. I just wanted to echo that this is a major issue – this problem is surprisingly hard to diagnose, not documented, and likely to affect many users who have no idea it is even a problem (for example here is one HF post on this topic with no responses as of the time of this post, but seemingly affected by the same problem).

Also, it seems quite likely that folks using Trainer with distributed training will be using IterableDatasets (I’ve also seen lots of uses of Trainer avoiding IterableDataset, maybe this is why?) – the kinds of large training runs that require distributed training will very often also require large datasets that are best processed as IterableDatasets.

Thanks to the HF team, and hopefully this can be escalated; otherwise this is a complete blocker for many users!

FYI, I think this was a regression introduced between transformers 4.30.x and 4.31.x (I haven’t bisected the specific release, but 4.30.x will run the example here correctly)