accelerate: Batch size computation wrong when using DeepSpeed?

System Info

- `Accelerate` version: 0.18.0
- Platform: Linux-5.15.0-1031-gcp-x86_64-with-glibc2.35
- Python version: 3.8.16
- Numpy version: 1.24.2
- PyTorch version (GPU?): 2.0.0+cu117 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Hi,

Apologies if I’m missing something obvious but when running the following script using Accelerate with DeepSpeed, I get the error

File ".../deepspeed/runtime/config.py", line 883, in _batch_assertion
    assert train_batch == micro_batch * grad_acc * self.world_size, (
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 2 != 1 * 1 * 1

When not using DeepSpeed, the same script runs fine. I am trying to run this on a machine with 2 GPUs but I’m not sure why DeepSpeed thinks the world_size is 1 - have I specified something incorrectly (I have num_processes: 2 in the Accelerate config)?

Below is the script I am trying to run:

from accelerate import Accelerator
from torch.utils.data import Dataset, DataLoader
from transformers import Adafactor, T5ForConditionalGeneration, T5TokenizerFast


LOAD_FROM = "google/flan-t5-large"
BATCH_SIZE = 2
TOKENIZE_MAX_LENGTH = 256

PROMPT = "Prompt"
LABELS = "Labels"


def _make_data_loader(
    tokenizer,
    tokenize_max_length,
):
    class TestDataset(Dataset):
        def __init__(
            self,
            tokenizer,
            tokenize_max_length,
        ):
            super().__init__()
            
            self._tokenizer = tokenizer
            self._tokenize_max_length = tokenize_max_length

        def __len__(
            self,
        ):
            return 9999999
        
        def __getitem__(
            self,
            index: int,
        ):
            return PROMPT, LABELS
        
        def collate_fn(
            self,
            batch,
        ):
            prompts = [b[0] for b in batch]
            labels = [b[1] for b in batch]

            prompts_tokenized = self._tokenizer(
                text=prompts,
                padding=True,
                truncation=True,
                max_length=self._tokenize_max_length,
                return_tensors="pt",
                return_attention_mask=True,
            )
            
            labels_tokenized = self._tokenizer(
                text=labels,
                padding=True,
                truncation=True,
                max_length=self._tokenize_max_length,
                return_tensors="pt",
                return_attention_mask=True,
            )

            return prompts_tokenized, labels_tokenized
    
    dataset = TestDataset(
        tokenizer=tokenizer,
        tokenize_max_length=tokenize_max_length,
    )

    data_loader = DataLoader(
        dataset=dataset,
        shuffle=True,
        batch_size=BATCH_SIZE,
        collate_fn=dataset.collate_fn,
    )

    return data_loader


def main():
    accelerator = Accelerator()

    model = T5ForConditionalGeneration.from_pretrained(LOAD_FROM)
    tokenizer = T5TokenizerFast.from_pretrained(LOAD_FROM)

    optimizer = Adafactor(
        params=model.parameters(),
        lr=1e-5,
        scale_parameter=False,
        relative_step=False,
    )

    data_loader = _make_data_loader(
        tokenizer=tokenizer,
        tokenize_max_length=TOKENIZE_MAX_LENGTH,
    )

    model, optimizer, data_loader = accelerator.prepare(
        model, optimizer, data_loader
    )

    global_step = 1
    for batch in data_loader:
        prompts_input_ids = batch[0]["input_ids"]
        prompts_attention_mask = batch[0]["attention_mask"]
        labels_input_ids = batch[1]["input_ids"]

        loss = model(
            input_ids=prompts_input_ids,
            attention_mask=prompts_attention_mask,
            labels=labels_input_ids,
        ).loss
        
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()

        loss_gathered = accelerator.gather_for_metrics(loss).mean()
        accelerator.print(f"Iteration {global_step}: loss = {loss_gathered}")

        global_step += 1


if __name__ == "__main__":
    main()

Expected behavior

DeepSpeed should be aware that the world_size is 2 and therefore the “macro” train_batch_size being 2 is correct.

About this issue

Original URL
State: closed
Created a year ago
Comments: 18 (4 by maintainers)

Most upvoted comments

update: the fix has been merged into deepspeed@main - if you want it working right away, otherwise please wait till 0.9.1 release.

stas00 on Apr 21, 2023

I am still getting this issue with deepspeed 0.13.2

soneyahossain on Feb 16, 2024

It fixes the issue, checked on my side

pacman100 on Apr 21, 2023

Yes, sorry for forgetting to mention!

harshil-shah on Apr 17, 2023