DeepSpeed: [BUG] batch_size check failed with zero 2 (deepspeed v0.9.0)
Describe the bug
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 16 != 2 * 1 * 1
This error only occurs when using deepspeed v0.9.0 and zero stage 2. My code trains normally with (deepspeed v0.9.0 + zero stage 3), or (deepspeed v0.8.3 +zero stage 2).
ds_report output
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/alpaca-lora/finetune_accelerate.py:520 in <module> │
│ │
│ 517 │
│ 518 │
│ 519 if __name__ == "__main__": │
│ ❱ 520 │ main() │
│ 521 │
│ │
│ /data/alpaca-lora/finetune_accelerate.py:347 in main │
│ │
│ 344 │ │ eval_dataloader, │
│ 345 │ │ optimizer, │
│ 346 │ │ lr_scheduler, │
│ ❱ 347 │ ) = accelerator.prepare( │
│ 348 │ │ model, train_dataloader, eval_dataloader, optimizer, lr_scheduler │
│ 349 │ ) │
│ 350 │ accelerator.print(model) │
│ │
│ /home/chenmingrui/.local/lib/python3.10/site-packages/accelerate/accelerator.py:1118 in prepare │
│ │
│ 1115 │ │ │ old_named_params = self._get_named_parameters(*args) │
│ 1116 │ │ │
│ 1117 │ │ if self.distributed_type == DistributedType.DEEPSPEED: │
│ ❱ 1118 │ │ │ result = self._prepare_deepspeed(*args) │
│ 1119 │ │ elif self.distributed_type == DistributedType.MEGATRON_LM: │
│ 1120 │ │ │ result = self._prepare_megatron_lm(*args) │
│ 1121 │ │ else: │
│ │
│ /home/chenmingrui/.local/lib/python3.10/site-packages/accelerate/accelerator.py:1415 in │
│ _prepare_deepspeed │
│ │
│ 1412 │ │ │ │ │ │ if type(scheduler).__name__ in deepspeed.runtime.lr_schedules.VA │
│ 1413 │ │ │ │ │ │ │ kwargs["lr_scheduler"] = scheduler │
│ 1414 │ │ │ │
│ ❱ 1415 │ │ │ engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs) │
│ 1416 │ │ │ if optimizer is not None: │
│ 1417 │ │ │ │ optimizer = DeepSpeedOptimizerWrapper(optimizer) │
│ 1418 │ │ │ if scheduler is not None: │
│ │
│ /home/chenmingrui/.local/lib/python3.10/site-packages/deepspeed/__init__.py:142 in initialize
│ │
│ 139 │ assert config != None, "DeepSpeed requires --deepspeed_config to specify configurati │
│ 140 │ │
│ 141 │ if not isinstance(model, PipelineModule): │
│ ❱ 142 │ │ config_class = DeepSpeedConfig(config, mpu) │
│ 143 │ │ if config_class.hybrid_engine.enabled: │
│ 144 │ │ │ engine = DeepSpeedHybridEngine(args=args, │
│ 145 │ │ │ │ │ │ │ │ │ │ model=model, │
│ │
│ /home/chenmingrui/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py:764 in │
│ __init__ │
│ │
│ 761 │ │ │
│ 762 │ │ # Pass a copy so that user json is unmodified, e.g. for logging │
│ 763 │ │ self._initialize_params(copy.copy(self._param_dict)) │
│ ❱ 764 │ │ self._configure_train_batch_size() │
│ 765 │ │ self._do_sanity_check() │
│ 766 │ │
│ 767 │ def _initialize_params(self, param_dict): │
│ │
│ /home/chenmingrui/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py:935 in │
│ _configure_train_batch_size │
│ │
│ 932 │ │
│ 933 │ def _configure_train_batch_size(self): │
│ 934 │ │ self._set_batch_related_parameters() │
│ ❱ 935 │ │ self._batch_assertion() │
│ 936 │ │
│ 937 │ def _do_sanity_check(self): │
│ 938 │ │ self._do_error_check() │
│ │
│ /home/chenmingrui/.local/lib/python3.10/site-packages/deepspeed/runtime/config.py:883 in │
│ _batch_assertion │
│ │
│ 880 │ │ │
│ 881 │ │ assert (grad_acc > 0), f"Gradient accumulation steps: {grad_acc} has to be great │
│ 882 │ │ │
│ ❱ 883 │ │ assert train_batch == micro_batch * grad_acc * self.world_size, ( │
│ 884 │ │ │ f"Check batch related parameters. train_batch_size is not equal " │
│ 885 │ │ │ "to micro_batch_per_gpu * gradient_acc_step * world_size " │
│ 886 │ │ │ f"{train_batch} != {micro_batch} * {grad_acc} * {self.world_size}") │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 16 != 2 * 1 * 1
System info (please complete the following information):
- OS: Ubuntu 22.04
- GPU count and types [e.g. one machines with x8 titanxp]
- Python version: 3.10
- deepspeed: 0.9.0
- accelerate: 0.18.0
Launcher context accelerate launch --config_file $ACCELERATE_CONFIG_NAME --deepspeed_config_file $DEEPSPEED_CONFIG_NAME --num_processes 8 --zero3_init_flag true alpaca-lora/finetune_accelerate.py
Additional context deepspeed config:
{
"fp16": {
"enabled": true,
"auto_cast": false,
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": false
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": 205520896,
"stage3_prefetch_bucket_size": 184968807,
"stage3_param_persistence_threshold": 143360,
"sub_group_size": 1e9,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"steps_per_print": 100,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": 1,
"wall_clock_breakdown": false
}
accelerate config:
compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 23 (6 by maintainers)
I have investigated the bug, and here are my findings:
deepspeed.initialize
deepspeed.initalize
creates aDeepSpeedConfig
class and then passes it to theDeepSpeedEngine
DeepSpeedEngine
initializes DeepSpeed communications throughdeepspeed.dist.init_distributed
The error is caused by
DeepSpeedConfig
expectinginit_distributed
to have already been called, but this only happens in the next step. This always raises an exception here:which means that the
world_size
is always set to 1, no matter what.@Yard1 thank you for investigating this. This is related to recent changes we made for the release of DeepSpeed-Chat. I’ll find a solution and get that merged soon. In the meantime, please use
deepspeed<0.9.0
.AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 10 != 1 * 1 * 5 latest version
Also getting wrong self.world_size when using zero3 while zero2 is alright in the same environment (8xA100)
i have deepspeed==0.11.1, transformers==4.34.1, accelerate==0.24.0 and ray==2.7.1
Thank you @Yard1 I will test on my end. PR is #3324 if you would like to try as well