accelerate: Batch size computation wrong when using DeepSpeed?
System Info
- `Accelerate` version: 0.18.0
- Platform: Linux-5.15.0-1031-gcp-x86_64-with-glibc2.35
- Python version: 3.8.16
- Numpy version: 1.24.2
- PyTorch version (GPU?): 2.0.0+cu117 (True)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: no
- use_cpu: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
Hi,
Apologies if I’m missing something obvious but when running the following script using Accelerate with DeepSpeed, I get the error
File ".../deepspeed/runtime/config.py", line 883, in _batch_assertion
assert train_batch == micro_batch * grad_acc * self.world_size, (
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 2 != 1 * 1 * 1
When not using DeepSpeed, the same script runs fine. I am trying to run this on a machine with 2 GPUs but I’m not sure why DeepSpeed thinks the world_size is 1 - have I specified something incorrectly (I have num_processes: 2 in the Accelerate config)?
Below is the script I am trying to run:
from accelerate import Accelerator
from torch.utils.data import Dataset, DataLoader
from transformers import Adafactor, T5ForConditionalGeneration, T5TokenizerFast
LOAD_FROM = "google/flan-t5-large"
BATCH_SIZE = 2
TOKENIZE_MAX_LENGTH = 256
PROMPT = "Prompt"
LABELS = "Labels"
def _make_data_loader(
tokenizer,
tokenize_max_length,
):
class TestDataset(Dataset):
def __init__(
self,
tokenizer,
tokenize_max_length,
):
super().__init__()
self._tokenizer = tokenizer
self._tokenize_max_length = tokenize_max_length
def __len__(
self,
):
return 9999999
def __getitem__(
self,
index: int,
):
return PROMPT, LABELS
def collate_fn(
self,
batch,
):
prompts = [b[0] for b in batch]
labels = [b[1] for b in batch]
prompts_tokenized = self._tokenizer(
text=prompts,
padding=True,
truncation=True,
max_length=self._tokenize_max_length,
return_tensors="pt",
return_attention_mask=True,
)
labels_tokenized = self._tokenizer(
text=labels,
padding=True,
truncation=True,
max_length=self._tokenize_max_length,
return_tensors="pt",
return_attention_mask=True,
)
return prompts_tokenized, labels_tokenized
dataset = TestDataset(
tokenizer=tokenizer,
tokenize_max_length=tokenize_max_length,
)
data_loader = DataLoader(
dataset=dataset,
shuffle=True,
batch_size=BATCH_SIZE,
collate_fn=dataset.collate_fn,
)
return data_loader
def main():
accelerator = Accelerator()
model = T5ForConditionalGeneration.from_pretrained(LOAD_FROM)
tokenizer = T5TokenizerFast.from_pretrained(LOAD_FROM)
optimizer = Adafactor(
params=model.parameters(),
lr=1e-5,
scale_parameter=False,
relative_step=False,
)
data_loader = _make_data_loader(
tokenizer=tokenizer,
tokenize_max_length=TOKENIZE_MAX_LENGTH,
)
model, optimizer, data_loader = accelerator.prepare(
model, optimizer, data_loader
)
global_step = 1
for batch in data_loader:
prompts_input_ids = batch[0]["input_ids"]
prompts_attention_mask = batch[0]["attention_mask"]
labels_input_ids = batch[1]["input_ids"]
loss = model(
input_ids=prompts_input_ids,
attention_mask=prompts_attention_mask,
labels=labels_input_ids,
).loss
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()
loss_gathered = accelerator.gather_for_metrics(loss).mean()
accelerator.print(f"Iteration {global_step}: loss = {loss_gathered}")
global_step += 1
if __name__ == "__main__":
main()
Expected behavior
DeepSpeed should be aware that the world_size is 2 and therefore the “macro” train_batch_size being 2 is correct.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 18 (4 by maintainers)
update: the fix has been merged into deepspeed@main - if you want it working right away, otherwise please wait till 0.9.1 release.
I am still getting this issue with deepspeed 0.13.2
It fixes the issue, checked on my side
Yes, sorry for forgetting to mention!