accelerate: Batch size computation wrong when using DeepSpeed?

System Info

- `Accelerate` version: 0.18.0
- Platform: Linux-5.15.0-1031-gcp-x86_64-with-glibc2.35
- Python version: 3.8.16
- Numpy version: 1.24.2
- PyTorch version (GPU?): 2.0.0+cu117 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Hi,

Apologies if I’m missing something obvious but when running the following script using Accelerate with DeepSpeed, I get the error

File ".../deepspeed/runtime/config.py", line 883, in _batch_assertion
    assert train_batch == micro_batch * grad_acc * self.world_size, (
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 2 != 1 * 1 * 1

When not using DeepSpeed, the same script runs fine. I am trying to run this on a machine with 2 GPUs but I’m not sure why DeepSpeed thinks the world_size is 1 - have I specified something incorrectly (I have num_processes: 2 in the Accelerate config)?

Below is the script I am trying to run:

from accelerate import Accelerator
from torch.utils.data import Dataset, DataLoader
from transformers import Adafactor, T5ForConditionalGeneration, T5TokenizerFast


LOAD_FROM = "google/flan-t5-large"
BATCH_SIZE = 2
TOKENIZE_MAX_LENGTH = 256

PROMPT = "Prompt"
LABELS = "Labels"


def _make_data_loader(
    tokenizer,
    tokenize_max_length,
):
    class TestDataset(Dataset):
        def __init__(
            self,
            tokenizer,
            tokenize_max_length,
        ):
            super().__init__()
            
            self._tokenizer = tokenizer
            self._tokenize_max_length = tokenize_max_length

        def __len__(
            self,
        ):
            return 9999999
        
        def __getitem__(
            self,
            index: int,
        ):
            return PROMPT, LABELS
        
        def collate_fn(
            self,
            batch,
        ):
            prompts = [b[0] for b in batch]
            labels = [b[1] for b in batch]

            prompts_tokenized = self._tokenizer(
                text=prompts,
                padding=True,
                truncation=True,
                max_length=self._tokenize_max_length,
                return_tensors="pt",
                return_attention_mask=True,
            )
            
            labels_tokenized = self._tokenizer(
                text=labels,
                padding=True,
                truncation=True,
                max_length=self._tokenize_max_length,
                return_tensors="pt",
                return_attention_mask=True,
            )

            return prompts_tokenized, labels_tokenized
    
    dataset = TestDataset(
        tokenizer=tokenizer,
        tokenize_max_length=tokenize_max_length,
    )

    data_loader = DataLoader(
        dataset=dataset,
        shuffle=True,
        batch_size=BATCH_SIZE,
        collate_fn=dataset.collate_fn,
    )

    return data_loader


def main():
    accelerator = Accelerator()

    model = T5ForConditionalGeneration.from_pretrained(LOAD_FROM)
    tokenizer = T5TokenizerFast.from_pretrained(LOAD_FROM)

    optimizer = Adafactor(
        params=model.parameters(),
        lr=1e-5,
        scale_parameter=False,
        relative_step=False,
    )

    data_loader = _make_data_loader(
        tokenizer=tokenizer,
        tokenize_max_length=TOKENIZE_MAX_LENGTH,
    )

    model, optimizer, data_loader = accelerator.prepare(
        model, optimizer, data_loader
    )

    global_step = 1
    for batch in data_loader:
        prompts_input_ids = batch[0]["input_ids"]
        prompts_attention_mask = batch[0]["attention_mask"]
        labels_input_ids = batch[1]["input_ids"]

        loss = model(
            input_ids=prompts_input_ids,
            attention_mask=prompts_attention_mask,
            labels=labels_input_ids,
        ).loss
        
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()

        loss_gathered = accelerator.gather_for_metrics(loss).mean()
        accelerator.print(f"Iteration {global_step}: loss = {loss_gathered}")

        global_step += 1


if __name__ == "__main__":
    main()

Expected behavior

DeepSpeed should be aware that the world_size is 2 and therefore the “macro” train_batch_size being 2 is correct.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 18 (4 by maintainers)

Most upvoted comments

update: the fix has been merged into deepspeed@main - if you want it working right away, otherwise please wait till 0.9.1 release.

I am still getting this issue with deepspeed 0.13.2

It fixes the issue, checked on my side

Yes, sorry for forgetting to mention!