DeepSpeed: Not being able to save T5-11B checkpoint using deepspeed
Describe the bug A clear and concise description of what the bug is. Not being able to save T5-11B checkpoint using deepspeed
To Reproduce Steps to reproduce the behavior:
export BS=12;
PYTHONPATH=../../../src
USE_TF=0
deepspeed --num_gpus=4 ./run_translation.py \
--model_name_or_path /local/nlp/temp/poetryT511B0/checkpoint-801 \
--output_dir /local/nlp/temp/poetryT511B1 \
--evaluation_strategy=steps \
--save_strategy=epoch \
--eval_steps 200 \
--save_steps 200 \
--do_train \
--do_eval \
--train_file /home/tuhin.chakr/gpt3/poetrynew/train.json \
--validation_file /home/tuhin.chakr/gpt3/poetrynew/val.json \
--learning_rate 1e-3 \
--gradient_accumulation_steps 21 \
--overwrite_output_dir \
--max_source_length 64 \
--max_target_length 64 \
--num_train_epochs 1 \
--per_device_train_batch_size $BS \
--per_device_eval_batch_size $BS \
--source_lang en_XX \
--target_lang en_XX \
--deepspeed /home/tuhin.chakr/gpt3/transformers/tests/deepspeed/ds_config_zero3_1.json
Expected behavior save checkpoint after training
ds_report output
[INFO|trainer.py:2250] 2022-01-10 09:42:18,771 >> Num examples = 65394
[INFO|trainer.py:2253] 2022-01-10 09:42:18,771 >> Batch size = 12
{'eval_loss': 1.306259274482727, 'eval_runtime': 1585.6465, 'eval_samples_per_second': 41.241, 'eval_steps_per_second': 0.86, 'epoch': 1.0}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 801/801 [20:22:08<00:00, 559.84s/it][INFO|trainer.py:2003] 2022-01-10 10:10:10,357 >> Saving model checkpoint to /local/nlp/temp/poetryT511B1/checkpoint-801
[INFO|configuration_utils.py:423] 2022-01-10 10:10:10,358 >> Configuration saved in /local/nlp/temp/poetryT511B1/checkpoint-801/config.json
[INFO|modeling_utils.py:1070] 2022-01-10 10:10:10,516 >> Model weights saved in /local/nlp/temp/poetryT511B1/checkpoint-801/pytorch_model.bin
[INFO|tokenization_utils_base.py:2043] 2022-01-10 10:10:10,517 >> tokenizer config file saved in /local/nlp/temp/poetryT511B1/checkpoint-801/tokenizer_config.json
[INFO|tokenization_utils_base.py:2049] 2022-01-10 10:10:10,517 >> Special tokens file saved in /local/nlp/temp/poetryT511B1/checkpoint-801/special_tokens_map.json
[INFO|tokenization_t5_fast.py:159] 2022-01-10 10:10:10,566 >> Copy vocab file to /local/nlp/temp/poetryT511B1/checkpoint-801/spiece.model
Traceback (most recent call last):
File "./run_translation.py", line 626, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
File "./run_translation.py", line 626, in <module>
File "./run_translation.py", line 626, in <module>
Traceback (most recent call last):
File "./run_translation.py", line 626, in <module>
main()
File "./run_translation.py", line 543, in main
main()
File "./run_translation.py", line 543, in main
main()
File "./run_translation.py", line 543, in main
main()
File "./run_translation.py", line 543, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1399, in train
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1399, in train
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1399, in train
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1399, in train
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1503, in _maybe_log_save_evaluate
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1503, in _maybe_log_save_evaluate
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1503, in _maybe_log_save_evaluate
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1503, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1565, in _save_checkpoint
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1565, in _save_checkpoint
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1565, in _save_checkpoint
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1565, in _save_checkpoint
self.save_model(output_dir)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1966, in save_model
self.save_model(output_dir)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1966, in save_model
self.save_model(output_dir)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1966, in save_model
self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 3024, in save_fp16_model
self.save_model(output_dir)
File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1966, in save_model
self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME)self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 3024, in save_fp16_model
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 3024, in save_fp16_model
self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 3024, in save_fp16_model
state_dict = self._zero3_consolidated_fp16_state_dict()
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2999, in _zero3_consolidated_fp16_state_dict
state_dict = self._zero3_consolidated_fp16_state_dict()
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2999, in _zero3_consolidated_fp16_state_dict
state_dict = self._zero3_consolidated_fp16_state_dict()
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2999, in _zero3_consolidated_fp16_state_dict
state_dict = self._zero3_consolidated_fp16_state_dict()
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2999, in _zero3_consolidated_fp16_state_dict
get_layer_state_dict(self.module, prefix="")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(self.module, prefix="")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(self.module, prefix="")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(self.module, prefix="")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
[Previous line repeated 4 more times]
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2991, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
[Previous line repeated 4 more times]
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2991, in get_layer_state_dict
state_dict[prefix + name] = buf.detach().cpu()
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 1327, in __exit__
get_layer_state_dict(child, prefix + name + ".")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
state_dict[prefix + name] = buf.detach().cpu()
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 1327, in __exit__
get_layer_state_dict(child, prefix + name + ".")
[Previous line repeated 4 more times]
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2991, in get_layer_state_dict
state_dict[prefix + name] = buf.detach().cpu()
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 1327, in __exit__
get_layer_state_dict(child, prefix + name + ".")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
get_layer_state_dict(child, prefix + name + ".")
[Previous line repeated 4 more times]
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2991, in get_layer_state_dict
state_dict[prefix + name] = buf.detach().cpu()
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 1327, in __exit__
self.params[0].partition(param_list=self.params, has_been_updated=True)self.params[0].partition(param_list=self.params, has_been_updated=True)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 604, in partition
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 604, in partition
self.params[0].partition(param_list=self.params, has_been_updated=True)self.params[0].partition(param_list=self.params, has_been_updated=True)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 604, in partition
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 604, in partition
self._partition(param_list, has_been_updated=has_been_updated)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 716, in _partition
self._partition(param_list, has_been_updated=has_been_updated)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 716, in _partition
self._partition(param_list, has_been_updated=has_been_updated)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 716, in _partition
self._partition(param_list, has_been_updated=has_been_updated)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 716, in _partition
self._partition_param(param, has_been_updated=has_been_updated)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 725, in _partition_param
self._partition_param(param, has_been_updated=has_been_updated)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 725, in _partition_param
self._partition_param(param, has_been_updated=has_been_updated)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 725, in _partition_param
self._partition_param(param, has_been_updated=has_been_updated)
File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 725, in _partition_param
assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
AssertionError: Parameter containing:
tensor([[-0.0184, 0.0311, 0.0164, ..., 0.0964, 0.0053, 0.0294],
[ 0.0060, -0.0118, 0.0124, ..., -0.0006, 0.0004, 0.0281],
[-0.0068, 0.0219, -0.0637, ..., 0.0357, 0.0150, 0.0212],
...,
[ 0.0526, -0.0020, 0.0183, ..., 0.0039, 0.0156, 0.0289],
[ 0.0212, -0.0099, -0.0158, ..., 0.0561, 0.0485, 0.0107],
[ 0.0658, 0.0129, 0.1380, ..., -0.0192, -0.0014, 0.0330]],
device='cuda:2', requires_grad=True) Cannot partition a param in flight
assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
AssertionError: Parameter containing:
tensor([[-0.0184, 0.0311, 0.0164, ..., 0.0964, 0.0053, 0.0294],
[ 0.0060, -0.0118, 0.0124, ..., -0.0006, 0.0004, 0.0281],
[-0.0068, 0.0219, -0.0637, ..., 0.0357, 0.0150, 0.0212],
...,
[ 0.0526, -0.0020, 0.0183, ..., 0.0039, 0.0156, 0.0289],
[ 0.0212, -0.0099, -0.0158, ..., 0.0561, 0.0485, 0.0107],
[ 0.0658, 0.0129, 0.1380, ..., -0.0192, -0.0014, 0.0330]],
device='cuda:1', requires_grad=True) Cannot partition a param in flight
assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
AssertionError: Parameter containing:
tensor([[-0.0184, 0.0311, 0.0164, ..., 0.0964, 0.0053, 0.0294],
[ 0.0060, -0.0118, 0.0124, ..., -0.0006, 0.0004, 0.0281],
[-0.0068, 0.0219, -0.0637, ..., 0.0357, 0.0150, 0.0212],
...,
[ 0.0526, -0.0020, 0.0183, ..., 0.0039, 0.0156, 0.0289],
[ 0.0212, -0.0099, -0.0158, ..., 0.0561, 0.0485, 0.0107],
[ 0.0658, 0.0129, 0.1380, ..., -0.0192, -0.0014, 0.0330]],
device='cuda:3', requires_grad=True) Cannot partition a param in flight
assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
AssertionError: Parameter containing:
tensor([[-0.0184, 0.0311, 0.0164, ..., 0.0964, 0.0053, 0.0294],
[ 0.0060, -0.0118, 0.0124, ..., -0.0006, 0.0004, 0.0281],
[-0.0068, 0.0219, -0.0637, ..., 0.0357, 0.0150, 0.0212],
...,
[ 0.0526, -0.0020, 0.0183, ..., 0.0039, 0.0156, 0.0289],
[ 0.0212, -0.0099, -0.0158, ..., 0.0561, 0.0485, 0.0107],
[ 0.0658, 0.0129, 0.1380, ..., -0.0192, -0.0014, 0.0330]],
device='cuda:0', requires_grad=True) Cannot partition a param in flight
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 36 (30 by maintainers)
I appreciate you re-validating the fix. @aphedges!
@tjruwase made it much easier to do so as he tirelessly makes the deepspeed codebase easier to adjust so I had changed its final version.
re: VCS
I just have a clone, which is even faster to
git pull
😃There are many ways to install a module 😉
Today I tried out the revised fix that made it into https://github.com/microsoft/DeepSpeed/commit/baef92e26fef5aa0da63f26d444b91c2a7aa0bd3 on my full script, and it worked properly! Thank you very much for your work on fixing this issue!
I am aware of the VCS installation in pip and it’s what I usually use, but it seems to be slightly faster to let GitHub zip it first, at least with n=1. I guess it also helps if one doesn’t have Git on their system for some reason.
The full program works!
I was using an editable install of
transformers
and it’s pure Python, so it’s not like I needed to recompile. I really have no clue. Anyway, if we discovered the fix that way, we probably wouldn’t have realized that were was an interaction between gradient accumulation and checkpointing.Not extremely relevant, but today I learned that
pip install https://github.com/stas00/DeepSpeed/archive/refs/heads/save_16bit_model-save_checkpoint_prologue.zip
works.I’m glad you figured out what the problem was! The fact that saving in the middle of gradient accumulation causes a failure is a really good explanation for why these two parameters together cause a problem. Switching from
--save-stratgy steps
(which defaults to 500 steps) to--save-strategy epoch
causes the failure because there likely wasn’t even 500 steps during training to encounter any issues with.I have tried out #1741, and I can confirm it works on my minimal example. I’ll try it out on my real tasks and verify it works there as well.
I did try copying the
save_pretrained
preamble as part of https://github.com/microsoft/DeepSpeed/issues/1686#issuecomment-1018590755. Checking the log file I attached, the line number in the stack trace match up with the line numbers in the diff. I have no clue why that did not work but your PR did.I’ve gotten the reproduction down to under 150 lines of code, most of which is not relevant.
Here are the steps to reproduce the bug:
$CUDA_HOME
appropriately set, runsetup.sh
. I used Python 3.9.9 and CUDA 11.1.0, and I’m not sure if it works with other versionsreproduce.sh
Here is the console output (log_2022_02_01_05_10_16.txt) and
ds_report
output (ds_report.txt).Success!
@tjruwase, Alex helped us to find the culprit. The problem happens when
save_interval % grad_accum != 0
Here is a trivial test case that reproduces the problem (this is just my standard launch cmd, so please ignore all but the last 2 args):
See the last 2 args.
I use grad_acum 3 and save interval 1! Boom!
Actually probably a much better test would be to actually put an actual gpu synchronization instead of sleep.
what happens if you put:
just before
deepspeed.save_checkpoint
as above?