accelerate: Incorrect `num_warmup_steps` for `lr_scheduler` for multi-gpu training

System Info

- `Accelerate` version: 0.10.0
- Platform: Linux-3.10.0_3-0-0-12-x86_64-with-centos-6.3-Final
- Python version: 3.7.12
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.7.1 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - main_process_ip: None
        - main_process_port: None
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

https://github.com/huggingface/transformers/blob/f2fbe4475386bfcfb3b83d0a3223ba216a3c3a91/examples/pytorch/translation/run_translation_no_trainer.py#L533

# define lr scheduler
lr_scheduler = get_scheduler(
        name="linear",
        optimizer=optimizer,
        num_warmup_steps=args.warmup_steps,
        num_training_steps=args.max_train_steps,
    )

...

if step % args.gradient_accumulation_steps == 0:                    
      optimizer.step()
      lr_scheduler.step() # update lr scheduler every `gradient_accumulation_steps`
      optimizer.zero_grad()

Expected behavior

Is the accelerate consider the num of processes for num_warmup_steps? Suppose we set args.warmup_steps=80 and train on a single 8-gpu machine, the linear learning rate will peak at 10 (i.e., 80/8) rather than expected 80.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19

Most upvoted comments

According to the design of accelerate

https://github.com/huggingface/accelerate/blob/d0f5f4a630bda69dcf89cc6d55f93c71f2af7a0d/src/accelerate/scheduler.py#L70

, is it correct to set the warmup_steps as warmup_steps*num_processes, or just do not prepare lr_scheduler?

Hello @cyk1337 , the link you have provided achieves args.max_train_steps // num_gpus because it is steping for num_processes per iteration, i.e., num_gpus times per iteration.

I didn’t understand what the query was in case of not preparing lr_scheduler. As per the original question, it is logical to have warmup steps to be reduced in a multi-device scenario.