DeepSpeed: Not being able to save T5-11B checkpoint using deepspeed

Describe the bug A clear and concise description of what the bug is. Not being able to save T5-11B checkpoint using deepspeed

To Reproduce Steps to reproduce the behavior:

export BS=12;
PYTHONPATH=../../../src
USE_TF=0

deepspeed --num_gpus=4 ./run_translation.py \
        --model_name_or_path  /local/nlp/temp/poetryT511B0/checkpoint-801 \
        --output_dir /local/nlp/temp/poetryT511B1 \
        --evaluation_strategy=steps \
        --save_strategy=epoch \
        --eval_steps 200 \
        --save_steps 200 \
        --do_train \
        --do_eval \
        --train_file /home/tuhin.chakr/gpt3/poetrynew/train.json \
        --validation_file /home/tuhin.chakr/gpt3/poetrynew/val.json \
        --learning_rate 1e-3 \
        --gradient_accumulation_steps 21 \
        --overwrite_output_dir \
        --max_source_length 64 \
        --max_target_length 64 \
        --num_train_epochs 1 \
        --per_device_train_batch_size $BS \
        --per_device_eval_batch_size $BS \
        --source_lang en_XX \
        --target_lang en_XX \
        --deepspeed /home/tuhin.chakr/gpt3/transformers/tests/deepspeed/ds_config_zero3_1.json

Expected behavior save checkpoint after training

ds_report output

[INFO|trainer.py:2250] 2022-01-10 09:42:18,771 >>   Num examples = 65394
[INFO|trainer.py:2253] 2022-01-10 09:42:18,771 >>   Batch size = 12
{'eval_loss': 1.306259274482727, 'eval_runtime': 1585.6465, 'eval_samples_per_second': 41.241, 'eval_steps_per_second': 0.86, 'epoch': 1.0}                                                                                                                                                                                                                          
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 801/801 [20:22:08<00:00, 559.84s/it][INFO|trainer.py:2003] 2022-01-10 10:10:10,357 >> Saving model checkpoint to /local/nlp/temp/poetryT511B1/checkpoint-801                                                                                                                                                                                                                                             
[INFO|configuration_utils.py:423] 2022-01-10 10:10:10,358 >> Configuration saved in /local/nlp/temp/poetryT511B1/checkpoint-801/config.json
[INFO|modeling_utils.py:1070] 2022-01-10 10:10:10,516 >> Model weights saved in /local/nlp/temp/poetryT511B1/checkpoint-801/pytorch_model.bin
[INFO|tokenization_utils_base.py:2043] 2022-01-10 10:10:10,517 >> tokenizer config file saved in /local/nlp/temp/poetryT511B1/checkpoint-801/tokenizer_config.json
[INFO|tokenization_utils_base.py:2049] 2022-01-10 10:10:10,517 >> Special tokens file saved in /local/nlp/temp/poetryT511B1/checkpoint-801/special_tokens_map.json
[INFO|tokenization_t5_fast.py:159] 2022-01-10 10:10:10,566 >> Copy vocab file to /local/nlp/temp/poetryT511B1/checkpoint-801/spiece.model
Traceback (most recent call last):
  File "./run_translation.py", line 626, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
  File "./run_translation.py", line 626, in <module>
  File "./run_translation.py", line 626, in <module>
Traceback (most recent call last):
  File "./run_translation.py", line 626, in <module>
    main()
  File "./run_translation.py", line 543, in main
    main()
  File "./run_translation.py", line 543, in main
    main()
  File "./run_translation.py", line 543, in main
    main()
  File "./run_translation.py", line 543, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1399, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1399, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1399, in train
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1399, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1503, in _maybe_log_save_evaluate
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1503, in _maybe_log_save_evaluate
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1503, in _maybe_log_save_evaluate
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1503, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1565, in _save_checkpoint
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1565, in _save_checkpoint
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1565, in _save_checkpoint
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1565, in _save_checkpoint
    self.save_model(output_dir)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1966, in save_model
    self.save_model(output_dir)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1966, in save_model
    self.save_model(output_dir)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1966, in save_model
    self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 3024, in save_fp16_model
    self.save_model(output_dir)
  File "/home/tuhin.chakr/yes/envs/fairseq/lib/python3.8/site-packages/transformers/trainer.py", line 1966, in save_model
        self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME)self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME)

  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 3024, in save_fp16_model
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 3024, in save_fp16_model
    self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 3024, in save_fp16_model
    state_dict = self._zero3_consolidated_fp16_state_dict()
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2999, in _zero3_consolidated_fp16_state_dict
    state_dict = self._zero3_consolidated_fp16_state_dict()
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2999, in _zero3_consolidated_fp16_state_dict
    state_dict = self._zero3_consolidated_fp16_state_dict()
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2999, in _zero3_consolidated_fp16_state_dict
    state_dict = self._zero3_consolidated_fp16_state_dict()
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2999, in _zero3_consolidated_fp16_state_dict
    get_layer_state_dict(self.module, prefix="")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(self.module, prefix="")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(self.module, prefix="")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(self.module, prefix="")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  [Previous line repeated 4 more times]
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2991, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  [Previous line repeated 4 more times]
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2991, in get_layer_state_dict
    state_dict[prefix + name] = buf.detach().cpu()
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 1327, in __exit__
    get_layer_state_dict(child, prefix + name + ".")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    state_dict[prefix + name] = buf.detach().cpu()
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 1327, in __exit__
    get_layer_state_dict(child, prefix + name + ".")
  [Previous line repeated 4 more times]
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2991, in get_layer_state_dict
    state_dict[prefix + name] = buf.detach().cpu()
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 1327, in __exit__
    get_layer_state_dict(child, prefix + name + ".")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2996, in get_layer_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  [Previous line repeated 4 more times]
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/engine.py", line 2991, in get_layer_state_dict
    state_dict[prefix + name] = buf.detach().cpu()
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 1327, in __exit__
        self.params[0].partition(param_list=self.params, has_been_updated=True)self.params[0].partition(param_list=self.params, has_been_updated=True)

  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 604, in partition
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 604, in partition
        self.params[0].partition(param_list=self.params, has_been_updated=True)self.params[0].partition(param_list=self.params, has_been_updated=True)

  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 604, in partition
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 604, in partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 716, in _partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 716, in _partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 716, in _partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 716, in _partition
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 725, in _partition_param
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 725, in _partition_param
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 725, in _partition_param
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/home/tuhin.chakr/DeepSpeed/deepspeed/runtime/zero/partition_parameters.py", line 725, in _partition_param
    assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
AssertionError:  Parameter containing:
tensor([[-0.0184,  0.0311,  0.0164,  ...,  0.0964,  0.0053,  0.0294],
        [ 0.0060, -0.0118,  0.0124,  ..., -0.0006,  0.0004,  0.0281],
        [-0.0068,  0.0219, -0.0637,  ...,  0.0357,  0.0150,  0.0212],
        ...,
        [ 0.0526, -0.0020,  0.0183,  ...,  0.0039,  0.0156,  0.0289],
        [ 0.0212, -0.0099, -0.0158,  ...,  0.0561,  0.0485,  0.0107],
        [ 0.0658,  0.0129,  0.1380,  ..., -0.0192, -0.0014,  0.0330]],
       device='cuda:2', requires_grad=True) Cannot partition a param in flight
    assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
AssertionError:  Parameter containing:
tensor([[-0.0184,  0.0311,  0.0164,  ...,  0.0964,  0.0053,  0.0294],
        [ 0.0060, -0.0118,  0.0124,  ..., -0.0006,  0.0004,  0.0281],
        [-0.0068,  0.0219, -0.0637,  ...,  0.0357,  0.0150,  0.0212],
        ...,
        [ 0.0526, -0.0020,  0.0183,  ...,  0.0039,  0.0156,  0.0289],
        [ 0.0212, -0.0099, -0.0158,  ...,  0.0561,  0.0485,  0.0107],
        [ 0.0658,  0.0129,  0.1380,  ..., -0.0192, -0.0014,  0.0330]],
       device='cuda:1', requires_grad=True) Cannot partition a param in flight
    assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
AssertionError:  Parameter containing:
tensor([[-0.0184,  0.0311,  0.0164,  ...,  0.0964,  0.0053,  0.0294],
        [ 0.0060, -0.0118,  0.0124,  ..., -0.0006,  0.0004,  0.0281],
        [-0.0068,  0.0219, -0.0637,  ...,  0.0357,  0.0150,  0.0212],
        ...,
        [ 0.0526, -0.0020,  0.0183,  ...,  0.0039,  0.0156,  0.0289],
        [ 0.0212, -0.0099, -0.0158,  ...,  0.0561,  0.0485,  0.0107],
        [ 0.0658,  0.0129,  0.1380,  ..., -0.0192, -0.0014,  0.0330]],
       device='cuda:3', requires_grad=True) Cannot partition a param in flight    
assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
AssertionError:  Parameter containing:
tensor([[-0.0184,  0.0311,  0.0164,  ...,  0.0964,  0.0053,  0.0294],
        [ 0.0060, -0.0118,  0.0124,  ..., -0.0006,  0.0004,  0.0281],
        [-0.0068,  0.0219, -0.0637,  ...,  0.0357,  0.0150,  0.0212],
        ...,
        [ 0.0526, -0.0020,  0.0183,  ...,  0.0039,  0.0156,  0.0289],
        [ 0.0212, -0.0099, -0.0158,  ...,  0.0561,  0.0485,  0.0107],
        [ 0.0658,  0.0129,  0.1380,  ..., -0.0192, -0.0014,  0.0330]],
       device='cuda:0', requires_grad=True) Cannot partition a param in flight
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 36 (30 by maintainers)

Most upvoted comments

I appreciate you re-validating the fix. @aphedges!

@tjruwase made it much easier to do so as he tirelessly makes the deepspeed codebase easier to adjust so I had changed its final version.


re: VCS

I just have a clone, which is even faster to git pull 😃

There are many ways to install a module 😉

Today I tried out the revised fix that made it into https://github.com/microsoft/DeepSpeed/commit/baef92e26fef5aa0da63f26d444b91c2a7aa0bd3 on my full script, and it worked properly! Thank you very much for your work on fixing this issue!


I am aware of the VCS installation in pip and it’s what I usually use, but it seems to be slightly faster to let GitHub zip it first, at least with n=1. I guess it also helps if one doesn’t have Git on their system for some reason.

Excellent. Let us know about the full program.

The full program works!

perhaps you changed it in the source, but the copy that was run was different. It’s all good.

I was using an editable install of transformers and it’s pure Python, so it’s not like I needed to recompile. I really have no clue. Anyway, if we discovered the fix that way, we probably wouldn’t have realized that were was an interaction between gradient accumulation and checkpointing.

Please let me know if you need help with figuring out how to install deepspeed from a custom branch.

Not extremely relevant, but today I learned that pip install https://github.com/stas00/DeepSpeed/archive/refs/heads/save_16bit_model-save_checkpoint_prologue.zip works.

I’m glad you figured out what the problem was! The fact that saving in the middle of gradient accumulation causes a failure is a really good explanation for why these two parameters together cause a problem. Switching from --save-stratgy steps (which defaults to 500 steps) to --save-strategy epoch causes the failure because there likely wasn’t even 500 steps during training to encounter any issues with.

I have tried out #1741, and I can confirm it works on my minimal example. I’ll try it out on my real tasks and verify it works there as well.

I did try copying the save_pretrained preamble as part of https://github.com/microsoft/DeepSpeed/issues/1686#issuecomment-1018590755. Checking the log file I attached, the line number in the stack trace match up with the line numbers in the diff. I have no clue why that did not work but your PR did.

I’ve gotten the reproduction down to under 150 lines of code, most of which is not relevant.

Here are the steps to reproduce the bug:

  • Download and extract reproduce_1686.zip
  • Navigate into the directory with the zip file contents
  • In a Python environment with $CUDA_HOME appropriately set, run setup.sh. I used Python 3.9.9 and CUDA 11.1.0, and I’m not sure if it works with other versions
  • Run reproduce.sh

Here is the console output (log_2022_02_01_05_10_16.txt) and ds_report output (ds_report.txt).

Success!

@tjruwase, Alex helped us to find the culprit. The problem happens when save_interval % grad_accum != 0

Here is a trivial test case that reproduces the problem (this is just my standard launch cmd, so please ignore all but the last 2 args):

USE_TF=0 deepspeed --num_gpus 2 examples/pytorch/translation/run_translation.py \
--model_name_or_path t5-small --output_dir /tmp/zero3 --overwrite_output_dir \
--max_train_samples 10 --max_eval_samples 10 --max_source_length 128 \
--max_target_length 128 --val_max_target_length 128 --do_train --do_eval \
--num_train_epochs 1 --per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 --learning_rate 3e-3 --warmup_steps 500 \
--predict_with_generate --eval_steps 1 --group_by_length \
--dataset_name wmt16 --dataset_config ro-en --source_lang en --target_lang ro \
--source_prefix 'translate English to Romanian: ' --deepspeed \
tests/deepspeed/ds_config_zero3.json \
--save_steps 1 --gradient_accumulation_steps 3

See the last 2 args.

I use grad_acum 3 and save interval 1! Boom!

[INFO|trainer.py:1263] 2022-01-31 21:14:28,673 >> ***** Running training *****
[INFO|trainer.py:1264] 2022-01-31 21:14:28,673 >>   Num examples = 10
[INFO|trainer.py:1265] 2022-01-31 21:14:28,673 >>   Num Epochs = 1
[INFO|trainer.py:1266] 2022-01-31 21:14:28,673 >>   Instantaneous batch size per device = 4
[INFO|trainer.py:1267] 2022-01-31 21:14:28,673 >>   Total train batch size (w. parallel, distributed & accumulation) = 24
[INFO|trainer.py:1268] 2022-01-31 21:14:28,673 >>   Gradient Accumulation steps = 3
[INFO|trainer.py:1269] 2022-01-31 21:14:28,673 >>   Total optimization steps = 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.73s/it][INFO|trainer.py:2114] 2022-01-31 21:14:31,515 >> Saving model checkpoint to /tmp/zero3/checkpoint-1
[INFO|configuration_utils.py:430] 2022-01-31 21:14:31,516 >> Configuration saved in /tmp/zero3/checkpoint-1/config.json
[INFO|modeling_utils.py:1074] 2022-01-31 21:14:31,739 >> Model weights saved in /tmp/zero3/checkpoint-1/pytorch_model.bin
[INFO|tokenization_utils_base.py:2074] 2022-01-31 21:14:31,740 >> tokenizer config file saved in /tmp/zero3/checkpoint-1/tokenizer_config.json
[INFO|tokenization_utils_base.py:2080] 2022-01-31 21:14:31,740 >> Special tokens file saved in /tmp/zero3/checkpoint-1/special_tokens_map.json
[INFO|tokenization_t5_fast.py:162] 2022-01-31 21:14:31,766 >> Copy vocab file to /tmp/zero3/checkpoint-1/spiece.model
Traceback (most recent call last):
  File "examples/pytorch/translation/run_translation.py", line 624, in <module>
    main()
  File "examples/pytorch/translation/run_translation.py", line 541, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1459, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1588, in _maybe_log_save_evaluate
Traceback (most recent call last):
  File "examples/pytorch/translation/run_translation.py", line 624, in <module>
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1656, in _save_checkpoint
    main()
  File "examples/pytorch/translation/run_translation.py", line 541, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1459, in train
    self.save_model(output_dir, _internal_call=True)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 2068, in save_model
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1588, in _maybe_log_save_evaluate
    if not self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME):
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 3096, in save_fp16_model
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1656, in _save_checkpoint
    self.save_model(output_dir, _internal_call=True)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 2068, in save_model
    return self.save_16bit_model(save_dir, save_filename)    
if not self.deepspeed.save_fp16_model(output_dir, WEIGHTS_NAME):
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 3096, in save_fp16_model
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 3122, in save_16bit_model
        state_dict = self._zero3_consolidated_16bit_state_dict()return self.save_16bit_model(save_dir, save_filename)

  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 3088, in _zero3_consolidated_16bit_state_dict
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 3122, in save_16bit_model
    get_layer_state_dict(self.module, prefix="")
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 3085, in get_layer_state_dict
    state_dict = self._zero3_consolidated_16bit_state_dict()
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 3088, in _zero3_consolidated_16bit_state_dict
    get_layer_state_dict(child, prefix + name + ".")
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 3080, in get_layer_state_dict
    get_layer_state_dict(self.module, prefix="")
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 3085, in get_layer_state_dict
    state_dict[prefix + name] = buf.detach().cpu()
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 1653, in __exit__
    get_layer_state_dict(child, prefix + name + ".")
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 3080, in get_layer_state_dict
    state_dict[prefix + name] = buf.detach().cpu()
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 1653, in __exit__
    self.params[0].partition(param_list=self.params, has_been_updated=True)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 899, in partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 1043, in _partition
    self.params[0].partition(param_list=self.params, has_been_updated=True)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 899, in partition
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 1043, in _partition
    return func(*args, **kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 1160, in _partition_param
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 1160, in _partition_param
    free_param(param)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 255, in free_param
    free_param(param)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    assert not param.ds_active_sub_modules, param.ds_summary()
AssertionError    return func(*args, **kwargs)
:   File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 255, in free_param
{'id': 132, 'status': 'AVAILABLE', 'numel': 16435200, 'ds_numel': 16435200, 'shape': (32100, 512), 'ds_shape': (32100, 512), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {111}}
    assert not param.ds_active_sub_modules, param.ds_summary()
AssertionError: {'id': 132, 'status': 'AVAILABLE', 'numel': 16435200, 'ds_numel': 16435200, 'shape': (32100, 512), 'ds_shape': (32100, 512), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {111}}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.23s/it]
[2022-01-31 21:14:32,954] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 35379
[2022-01-31 21:14:32,955] [INFO] [launch.py:178:sigkill_handler] Killing subprocess 35380
[2022-01-31 21:14:32,955] [ERROR] [launch.py:184:sigkill_handler] ['/home/stas/anaconda3/envs/py38-pt110/bin/python', '-u', 'examples/pytorch/translation/run_translation.py', '--local_rank=1', '--model_name_or_path', 't5-small', '--output_dir', '/tmp/zero3', '--overwrite_output_dir', '--max_train_samples', '10', '--max_eval_samples', '10', '--max_source_length', '128', '--max_target_length', '128', '--val_max_target_length', '128', '--do_train', '--do_eval', '--num_train_epochs', '1', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--learning_rate', '3e-3', '--warmup_steps', '500', '--predict_with_generate', '--save_steps', '1', '--eval_steps', '1', '--group_by_length', '--dataset_name', 'wmt16', '--dataset_config', 'ro-en', '--source_lang', 'en', '--target_lang', 'ro', '--source_prefix', 'translate English to Romanian: ', '--deepspeed', 'tests/deepspeed/ds_config_zero3.json', '--gradient_accumulation_steps', '3'] exits with return code = 1
Command exited with non-zero status 1

Actually probably a much better test would be to actually put an actual gpu synchronization instead of sleep.

what happens if you put:

import torch.distributed as dist
dist.barrier()

just before deepspeed.save_checkpoint as above?