transformers: Error on loading saved optimizer after training (zero-3)
System Info
Platform: Ubuntu 18.04.1
python3.8.0
cuda-11.3
torch==1.11.0+cu113 (GPU)
transformers==4.18.0
deepspeed==0.6.3
huggingface_hub version: 0.5.1
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Use any code which trains and evaluates a model with huggingface trainer (e.g https://github.com/ElementAI/picard/blob/main/seq2seq/run_seq2seq.py#L216-L267). Use save_steps=1
in config. Train for few epochs and evaluate. An error is thrown when the model is trying to load the optimizer after training.
OPTIMIZER USED: adafactor (issue also occurs with adaw_hf
, adamw_torch
)
ZeRO-3 config (used the same from hf page)
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
}
Traceback:
Training completed. Do not forget to share your model on huggingface.co/models =)
Loading best model from ./output/checkpoint-12 (score: 70.1).
[2022-05-06 14:15:47,319] [INFO] [logging.py:69:log_dist] [Rank 0] DeepSpeed info: version=0.6.3, git-hash=unknown, git-branch=unknown
[2022-05-06 14:15:47,323] [INFO] [engine.py:278:__init__] DeepSpeed Flops Profiler Enabled: False
[2022-05-06 14:15:47,323] [INFO] [engine.py:1042:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2022-05-06 14:15:47,323] [INFO] [engine.py:1048:_configure_optimizer] Using client Optimizer as basic optimizer
[2022-05-06 14:15:47,324] [INFO] [engine.py:1064:_configure_optimizer] DeepSpeed Basic Optimizer = Adafactor
[2022-05-06 14:15:47,324] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=Adafactor type=<class 'transformers.optimization.Adafactor'>
[2022-05-06 14:15:47,324] [WARNING] [engine.py:1077:_configure_optimizer] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2022-05-06 14:15:47,324] [INFO] [logging.py:69:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
[2022-05-06 14:15:47,324] [INFO] [engine.py:1362:_configure_zero_optimizer] Initializing ZeRO Stage 3
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
[2022-05-06 14:15:47,325] [INFO] [stage3.py:273:__init__] Reduce bucket size 262144
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
[2022-05-06 14:15:47,325] [INFO] [stage3.py:274:__init__] Allgather bucket size 235929.6
Loading extension module utils...
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...No modifications detected for re-loaded extension module utils, skipping build step...
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...Loading extension module utils...
Time to load utils op: 0.0006072521209716797 seconds
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0006115436553955078 seconds
Traceback (most recent call last):
Time to load utils op: 0.0006079673767089844 secondsNo modifications detected for re-loaded extension module utils, skipping build step...
File "train.py", line 227, in <module>
No modifications detected for re-loaded extension module utils, skipping build step...Loading extension module utils...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005943775177001953 secondsLoading extension module utils...
Using /home/user/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Traceback (most recent call last):
File "train.py", line 227, in <module>
Traceback (most recent call last):
File "train.py", line 227, in <module>
Traceback (most recent call last):
Time to load utils op: 0.0006155967712402344 seconds
File "train.py", line 227, in <module>
main()Time to load utils op: 0.0006160736083984375 seconds
Time to load utils op: 0.0006232261657714844 seconds File "train.py", line 176, in main
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
main()
main()train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "train.py", line 176, in main
Traceback (most recent call last):
File "train.py", line 176, in main
File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
File "train.py", line 227, in <module>
Traceback (most recent call last):
main()Traceback (most recent call last):
File "train.py", line 227, in <module>
File "train.py", line 227, in <module>
Time to load utils op: 0.0004525184631347656 seconds File "train.py", line 176, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
train_result = trainer.train(resume_from_checkpoint=checkpoint)main()
File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
File "train.py", line 176, in main
Traceback (most recent call last):
File "train.py", line 227, in <module>
main()main()
File "train.py", line 176, in main
File "train.py", line 176, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
train_result = trainer.train(resume_from_checkpoint=checkpoint)
train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
main()
File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
File "train.py", line 176, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/user/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1546, in train
deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self) deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self) deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)
deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)
deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)
File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self)
File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
deepspeed_engine, optimizer, lr_scheduler = deepspeed_reinit(self) File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
File "/home/user/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 374, in deepspeed_reinit
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs) File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs) File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs) deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**trainer.deepspeed_initialize_kwargs)
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
engine = DeepSpeedEngine(args=args, File "/home/user/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
engine = DeepSpeedEngine(args=args,
engine = DeepSpeedEngine(args=args,
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
engine = DeepSpeedEngine(args=args,
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
engine = DeepSpeedEngine(args=args,engine = DeepSpeedEngine(args=args,
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
engine = DeepSpeedEngine(args=args, File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
engine = DeepSpeedEngine(args=args, File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 294, in __init__
self._configure_optimizer(optimizer, model_parameters)self._configure_optimizer(optimizer, model_parameters)
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
self._configure_optimizer(optimizer, model_parameters)self._configure_optimizer(optimizer, model_parameters)self._configure_optimizer(optimizer, model_parameters) File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
self._configure_optimizer(optimizer, model_parameters)self._configure_optimizer(optimizer, model_parameters)
self._configure_optimizer(optimizer, model_parameters)
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1080, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)self.optimizer = self._configure_zero_optimizer(basic_optimizer) self.optimizer = self._configure_zero_optimizer(basic_optimizer)self.optimizer = self._configure_zero_optimizer(basic_optimizer)
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1365, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer_Stage3( optimizer = DeepSpeedZeroOptimizer_Stage3( optimizer = DeepSpeedZeroOptimizer_Stage3(
optimizer = DeepSpeedZeroOptimizer_Stage3(optimizer = DeepSpeedZeroOptimizer_Stage3(
optimizer = DeepSpeedZeroOptimizer_Stage3(
optimizer = DeepSpeedZeroOptimizer_Stage3(optimizer = DeepSpeedZeroOptimizer_Stage3( File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
File "/home/user/venv/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 293, in __init__
self.dtype = self.optimizer.param_groups[0]['params'][0].dtypeself.dtype = self.optimizer.param_groups[0]['params'][0].dtypeself.dtype = self.optimizer.param_groups[0]['params'][0].dtype
self.dtype = self.optimizer.param_groups[0]['params'][0].dtype self.dtype = self.optimizer.param_groups[0]['params'][0].dtypeIndexError
IndexErrorself.dtype = self.optimizer.param_groups[0]['params'][0].dtypeself.dtype = self.optimizer.param_groups[0]['params'][0].dtypeIndexErrorself.dtype = self.optimizer.param_groups[0]['params'][0].dtype
: :
: IndexError
list index out of rangelist index out of rangeIndexErrorlist index out of range: IndexErrorIndexError
: IndexError
list index out of range: : list index out of range:
list index out of rangelist index out of range
list index out of range
Expected behavior
The code should be able to reload the optimizer w/o errors.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 18 (11 by maintainers)
Perfect. Thank you for validating, @base-y
I will merge the HF PR once the Deepspeed merges their side and makes a new release.
cc: @tjruwase
Hey, sorry for the delayed response. Sure, I will install huggingface and deepspeed locally from the PR branches and check if it works asap.
super! that’s very helpful, @base-y
I’m able to reproduce the failure:
I will try to analyze this later today or tomorrow.