transformers: Error on loading saved optimizer after training (zero-3)

System Info

Platform: Ubuntu 18.04.1
python3.8.0
cuda-11.3
torch==1.11.0+cu113 (GPU)
transformers==4.18.0
deepspeed==0.6.3
huggingface_hub version: 0.5.1

Who can help?

@sgugger, @patrickvonplaten

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

Use any code which trains and evaluates a model with huggingface trainer (e.g https://github.com/ElementAI/picard/blob/main/seq2seq/run_seq2seq.py#L216-L267). Use save_steps=1 in config. Train for few epochs and evaluate. An error is thrown when the model is trying to load the optimizer after training.

OPTIMIZER USED: adafactor (issue also occurs with adaw_hf, adamw_torch)

ZeRO-3 config (used the same from hf page)

{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
}

Traceback:

Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./output/checkpoint-12 (score: 70.1).
[2022-05-06 14:15:47,319] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.3, git-hash=unknown, git-branch=unknown
[2022-05-06 14:15:47,323] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False
[2022-05-06 14:15:47,323] [INFO] [engine.py:1042:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2022-05-06 14:15:47,323] [INFO] [engine.py:1048:_configure_optimizer] Using client Optimizer as basic optimizer
[2022-05-06 14:15:47,324] [INFO] [engine.py:1064:_configure_optimizer] DeepSpeed Basic Optimizer = Adafactor
[2022-05-06 14:15:47,324] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=Adafactor type=<class 'transformers.optimization.Adafactor'>
[2022-05-06 14:15:47,324] [WARNING] [engine.py:1077:_configure_optimizer] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2022-05-06 14:15:47,324] [INFO] [logging.py:69:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
[2022-05-06 14:15:47,324] [INFO] [engine.py:1362:_configure_zero_optimizer] Initializing ZeRO Stage 3
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
[2022-05-06 14:15:47,325] [INFO] [stage3.py:273:__init__] Reduce bucket size 262144
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
[2022-05-06 14:15:47,325] [INFO] [stage3.py:274:__init__] Allgather bucket size 235929.6
Loading extension module utils...
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...No modifications detected for re-loaded extension module utils, skipping build step...

Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...Loading extension module utils...


Time to load utils op: 0.0006072521209716797 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0006115436553955078 seconds
Traceback (most recent call last):
Time to load utils op: 0.0006079673767089844 secondsNo modifications detected for re-loaded extension module utils, skipping build step...
  File "train.py", line 227, in <module>

No modifications detected for re-loaded extension module utils, skipping build step...Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...

Time to load utils op: 0.0005943775177001953 secondsLoading extension module utils...

Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Traceback (most recent call last):
  File "train.py", line 227, in <module>
Traceback (most recent call last):
  File "train.py", line 227, in <module>
Traceback (most recent call last):
Time to load utils op: 0.0006155967712402344 seconds    
  File "train.py", line 227, in <module>
main()Time to load utils op: 0.0006160736083984375 seconds

Time to load utils op: 0.0006232261657714844 seconds  File "train.py", line 176, in main

No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
    main()
        main()train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "train.py", line 176, in main

Traceback (most recent call last):
  File "train.py", line 176, in main
  File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
  File "train.py", line 227, in <module>
    Traceback (most recent call last):
main()Traceback (most recent call last):
  File "train.py", line 227, in <module>

  File "train.py", line 227, in <module>
Time to load utils op: 0.0004525184631347656 seconds  File "train.py", line 176, in main

        train_result = trainer.train(resume_from_checkpoint=checkpoint)train_result = trainer.train(resume_from_checkpoint=checkpoint)

  File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
  File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
        train_result = trainer.train(resume_from_checkpoint=checkpoint)main()

  File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
  File "train.py", line 176, in main
Traceback (most recent call last):
  File "train.py", line 227, in <module>
        main()main()

  File "train.py", line 176, in main
  File "train.py", line 176, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)    
train_result = trainer.train(resume_from_checkpoint=checkpoint)  File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
    
main()
  File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
  File "train.py", line 176, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
                deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)    deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)    deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)
deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)    
deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)


  File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
  File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)    
  File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
  File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
  File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit

deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)  File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit

  File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
  File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
        deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)    

deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize

    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)          File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize

    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize

  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize

          File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
engine = DeepSpeedEngine(args=args,  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
engine = DeepSpeedEngine(args=args,    

engine = DeepSpeedEngine(args=args,
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
    engine = DeepSpeedEngine(args=args,
      File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
    engine = DeepSpeedEngine(args=args,engine = DeepSpeedEngine(args=args,
    
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
engine = DeepSpeedEngine(args=args,      File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__

engine = DeepSpeedEngine(args=args,  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__

  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
        self._configure_optimizer(optimizer, model_parameters)self._configure_optimizer(optimizer, model_parameters)
        
          File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
    self._configure_optimizer(optimizer, model_parameters)self._configure_optimizer(optimizer, model_parameters)self._configure_optimizer(optimizer, model_parameters)  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
    self._configure_optimizer(optimizer, model_parameters)self._configure_optimizer(optimizer, model_parameters)


self._configure_optimizer(optimizer, model_parameters)

  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer

  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
                self.optimizer = self._configure_zero_optimizer(basic_optimizer)self.optimizer = self._configure_zero_optimizer(basic_optimizer)    self.optimizer = self._configure_zero_optimizer(basic_optimizer)self.optimizer = self._configure_zero_optimizer(basic_optimizer)

    self.optimizer = self._configure_zero_optimizer(basic_optimizer)

      File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
      File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
self.optimizer = self._configure_zero_optimizer(basic_optimizer)  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer

  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer

  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
            optimizer = DeepSpeedZeroOptimizer_Stage3(        optimizer = DeepSpeedZeroOptimizer_Stage3(    optimizer = DeepSpeedZeroOptimizer_Stage3(        
optimizer = DeepSpeedZeroOptimizer_Stage3(optimizer = DeepSpeedZeroOptimizer_Stage3(
optimizer = DeepSpeedZeroOptimizer_Stage3(
optimizer = DeepSpeedZeroOptimizer_Stage3(optimizer = DeepSpeedZeroOptimizer_Stage3(  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__


  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__

  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__


  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
            self.dtype = self.optimizer.param_groups[0]['params'][0].dtypeself.dtype = self.optimizer.param_groups[0]['params'][0].dtypeself.dtype = self.optimizer.param_groups[0]['params'][0].dtype
    
    
self.dtype = self.optimizer.param_groups[0]['params'][0].dtype            self.dtype = self.optimizer.param_groups[0]['params'][0].dtypeIndexError
IndexErrorself.dtype = self.optimizer.param_groups[0]['params'][0].dtypeself.dtype = self.optimizer.param_groups[0]['params'][0].dtypeIndexErrorself.dtype = self.optimizer.param_groups[0]['params'][0].dtype
: : 

: IndexError
list index out of rangelist index out of rangeIndexErrorlist index out of range: IndexErrorIndexError

: IndexError
list index out of range: : list index out of range: 
list index out of rangelist index out of range
list index out of range

Expected behavior

The code should be able to reload the optimizer w/o errors.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 18 (11 by maintainers)

Most upvoted comments

Perfect. Thank you for validating, @base-y

I will merge the HF PR once the Deepspeed merges their side and makes a new release.

cc: @tjruwase

Hey, sorry for the delayed response. Sure, I will install huggingface and deepspeed locally from the PR branches and check if it works asap.

super! that’s very helpful, @base-y

I’m able to reproduce the failure:

$ CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.run --nproc_per_node=1  --master_addr='127.0.0.1' --master_port=9901 test.py
[...]
Traceback (most recent call last):
  File "test.py", line 129, in <module>
    trainer.train()
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/trainer.py", line 1545, in train
    self._load_best_model()
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/trainer.py", line 1608, in _load_best_model
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)
  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/deepspeed.py", line 374, in deepspeed_reinit
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 295, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 1081, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 1366, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/runtime/zero/stage3.py", line 610, in __init__
    self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
IndexError: list index out of range

I will try to analyze this later today or tomorrow.