transformers: Error on loading saved optimizer after training (zero-3)
System Info
Platform: Ubuntu 18.04.1
python3.8.0
cuda-11.3
torch==1.11.0+cu113 (GPU)
transformers==4.18.0
deepspeed==0.6.3
huggingface_hub version: 0.5.1
Who can help?
Information
- The official example scripts
 - My own modified scripts
 
Tasks
-  An officially supported task in the 
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
 
Reproduction
Use any code which trains and evaluates a model with huggingface trainer (e.g https://github.com/ElementAI/picard/blob/main/seq2seq/run_seq2seq.py#L216-L267). Use save_steps=1 in config. Train for few epochs and evaluate. An error is thrown when the model is trying to load the optimizer after training.
OPTIMIZER USED: adafactor (issue also occurs with adaw_hf, adamw_torch)
ZeRO-3 config (used the same from hf page)
{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
}
Traceback:
Training completed. Do not forget to share your model on huggingface.co/models =)
Loading best model from ./output/checkpoint-12 (score: 70.1).
[2022-05-06 14:15:47,319] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.3, git-hash=unknown, git-branch=unknown
[2022-05-06 14:15:47,323] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False
[2022-05-06 14:15:47,323] [INFO] [engine.py:1042:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2022-05-06 14:15:47,323] [INFO] [engine.py:1048:_configure_optimizer] Using client Optimizer as basic optimizer
[2022-05-06 14:15:47,324] [INFO] [engine.py:1064:_configure_optimizer] DeepSpeed Basic Optimizer = Adafactor
[2022-05-06 14:15:47,324] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=Adafactor type=<class 'transformers.optimization.Adafactor'>
[2022-05-06 14:15:47,324] [WARNING] [engine.py:1077:_configure_optimizer] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2022-05-06 14:15:47,324] [INFO] [logging.py:69:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
[2022-05-06 14:15:47,324] [INFO] [engine.py:1362:_configure_zero_optimizer] Initializing ZeRO Stage 3
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
[2022-05-06 14:15:47,325] [INFO] [stage3.py:273:__init__] Reduce bucket size 262144
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
[2022-05-06 14:15:47,325] [INFO] [stage3.py:274:__init__] Allgather bucket size 235929.6
Loading extension module utils...
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...No modifications detected for re-loaded extension module utils, skipping build step...
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...Loading extension module utils...
Time to load utils op: 0.0006072521209716797 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0006115436553955078 seconds
Traceback (most recent call last):
Time to load utils op: 0.0006079673767089844 secondsNo modifications detected for re-loaded extension module utils, skipping build step...
  File "train.py", line 227, in <module>
No modifications detected for re-loaded extension module utils, skipping build step...Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005943775177001953 secondsLoading extension module utils...
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Traceback (most recent call last):
  File "train.py", line 227, in <module>
Traceback (most recent call last):
  File "train.py", line 227, in <module>
Traceback (most recent call last):
Time to load utils op: 0.0006155967712402344 seconds    
  File "train.py", line 227, in <module>
main()Time to load utils op: 0.0006160736083984375 seconds
Time to load utils op: 0.0006232261657714844 seconds  File "train.py", line 176, in main
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
    main()
        main()train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "train.py", line 176, in main
Traceback (most recent call last):
  File "train.py", line 176, in main
  File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
  File "train.py", line 227, in <module>
    Traceback (most recent call last):
main()Traceback (most recent call last):
  File "train.py", line 227, in <module>
  File "train.py", line 227, in <module>
Time to load utils op: 0.0004525184631347656 seconds  File "train.py", line 176, in main
        train_result = trainer.train(resume_from_checkpoint=checkpoint)train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
  File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
        train_result = trainer.train(resume_from_checkpoint=checkpoint)main()
  File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
  File "train.py", line 176, in main
Traceback (most recent call last):
  File "train.py", line 227, in <module>
        main()main()
  File "train.py", line 176, in main
  File "train.py", line 176, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)    
train_result = trainer.train(resume_from_checkpoint=checkpoint)  File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
    
main()
  File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
  File "train.py", line 176, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
                deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)    deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)    deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)
deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)    
deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)
  File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
  File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)    
  File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
  File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
  File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)  File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
  File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
  File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
        deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)    
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)          File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
          File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
engine = DeepSpeedEngine(args=args,  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
engine = DeepSpeedEngine(args=args,    
engine = DeepSpeedEngine(args=args,
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
    engine = DeepSpeedEngine(args=args,
      File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
    engine = DeepSpeedEngine(args=args,engine = DeepSpeedEngine(args=args,
    
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
engine = DeepSpeedEngine(args=args,      File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
engine = DeepSpeedEngine(args=args,  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
        self._configure_optimizer(optimizer, model_parameters)self._configure_optimizer(optimizer, model_parameters)
        
          File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
    self._configure_optimizer(optimizer, model_parameters)self._configure_optimizer(optimizer, model_parameters)self._configure_optimizer(optimizer, model_parameters)  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
    self._configure_optimizer(optimizer, model_parameters)self._configure_optimizer(optimizer, model_parameters)
self._configure_optimizer(optimizer, model_parameters)
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
                self.optimizer = self._configure_zero_optimizer(basic_optimizer)self.optimizer = self._configure_zero_optimizer(basic_optimizer)    self.optimizer = self._configure_zero_optimizer(basic_optimizer)self.optimizer = self._configure_zero_optimizer(basic_optimizer)
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
      File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
      File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
self.optimizer = self._configure_zero_optimizer(basic_optimizer)  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
            optimizer = DeepSpeedZeroOptimizer_Stage3(        optimizer = DeepSpeedZeroOptimizer_Stage3(    optimizer = DeepSpeedZeroOptimizer_Stage3(        
optimizer = DeepSpeedZeroOptimizer_Stage3(optimizer = DeepSpeedZeroOptimizer_Stage3(
optimizer = DeepSpeedZeroOptimizer_Stage3(
optimizer = DeepSpeedZeroOptimizer_Stage3(optimizer = DeepSpeedZeroOptimizer_Stage3(  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
  File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
            self.dtype = self.optimizer.param_groups[0]['params'][0].dtypeself.dtype = self.optimizer.param_groups[0]['params'][0].dtypeself.dtype = self.optimizer.param_groups[0]['params'][0].dtype
    
    
self.dtype = self.optimizer.param_groups[0]['params'][0].dtype            self.dtype = self.optimizer.param_groups[0]['params'][0].dtypeIndexError
IndexErrorself.dtype = self.optimizer.param_groups[0]['params'][0].dtypeself.dtype = self.optimizer.param_groups[0]['params'][0].dtypeIndexErrorself.dtype = self.optimizer.param_groups[0]['params'][0].dtype
: : 
: IndexError
list index out of rangelist index out of rangeIndexErrorlist index out of range: IndexErrorIndexError
: IndexError
list index out of range: : list index out of range: 
list index out of rangelist index out of range
list index out of range
Expected behavior
The code should be able to reload the optimizer w/o errors.
About this issue
- Original URL
 - State: closed
 - Created 2 years ago
 - Comments: 18 (11 by maintainers)
 
Perfect. Thank you for validating, @base-y
I will merge the HF PR once the Deepspeed merges their side and makes a new release.
cc: @tjruwase
Hey, sorry for the delayed response. Sure, I will install huggingface and deepspeed locally from the PR branches and check if it works asap.
super! that’s very helpful, @base-y
I’m able to reproduce the failure:
I will try to analyze this later today or tomorrow.