DeepSpeed: zero_to_fp32.py still imports a wrong model after fix

Same as #1165; the issue is closed now so I’m opening a new one. It looks like the zero_to_fp32.py still cannot load the correct weights under multi-gpu and ZERO-2 setting.

Loading ‘mp_rank_00_model_states.pt’:

states[‘module’][‘deberta.encoder.layer.0.output.dense.weight’] tensor([[-0.0211, 0.0068, 0.0206, …, 0.0057, 0.0316, 0.0256], [ 0.0273, 0.0141, 0.0118, …, -0.0122, 0.0054, 0.0010], [ 0.0479, -0.0237, -0.0604, …, -0.0340, -0.0183, 0.0691], …, [ 0.0270, -0.0231, 0.0218, …, 0.0563, 0.0641, -0.0094], [-0.0563, -0.0837, -0.0427, …, 0.0242, -0.0132, -0.0512], [-0.0012, 0.0064, 0.0465, …, 0.0219, 0.0259, -0.0281]], device=‘cuda:0’, dtype=torch.float16)

Loading the exported weights using load_state_dict_from_zero_checkpoint: (Pdb) self.deberta.encoder.layer[0].output.dense.weight Parameter containing: tensor([[ 0.0207, -0.0448, 0.0022, …, 0.0406, -0.0338, -0.0174], [-0.0577, -0.0648, 0.0404, …, 0.0108, -0.0167, -0.0100], [ 0.0548, 0.0063, 0.0024, …, 0.0311, 0.0249, 0.0167], …, [-0.0081, 0.0194, -0.0266, …, -0.0269, -0.0002, 0.0257], [ 0.0202, -0.0002, 0.0831, …, -0.0008, -0.0094, 0.0258], [-0.0320, 0.0529, -0.0259, …, 0.0117, -0.0292, -0.0064]],

I’m using Deepspeed 0.4.5. I can confirm the problem happens on 2gpus, and not on 1gpu.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 29 (26 by maintainers)

Most upvoted comments

@stas00 I’m sorry that I may have made a mistake.

  1. As I previously commented (https://github.com/microsoft/DeepSpeed/issues/1317#issuecomment-929753367), the test failed with my code tweaking the model params. And the test also failed with clean code as I mentioned (https://github.com/microsoft/DeepSpeed/issues/1317#issuecomment-929783068).
  2. But the -sv args doesn’t print any traceback information, so at that time I didn’t realize what caused the failure. Now I try the -v args and print the traceback message as follow. It seems related to some permission error instead of the DeepSpeed’s bug.
Traceback (most recent call last):                                                                                                                                                   
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap                                                                             
    self.run()                                                                                                                                                                       
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/multiprocessing/process.py", line 93, in run                                                                                     
    self._target(*self._args, **self._kwargs)                                                                                                                                        
  File "/home/huangbz/git_repo/DeepSpeed/tests/unit/common.py", line 53, in dist_init                                                                                                
    run_func(*func_args, **func_kwargs)                                                                                                                                              
  File "/home/huangbz/git_repo/DeepSpeed/tests/unit/test_zero.py", line 196, in _test_zero_to_fp32
    model_parameters=model.parameters())
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/__init__.py", line 141, in initialize
    config_params=config_params)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/runtime/engine.py", line 220, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/runtime/engine.py", line 860, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/runtime/engine.py", line 942, in _configure_basic_optimizer
    adam_w_mode=effective_adam_w_mode)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/ops/op_builder/builder.py", line 355, in load
    return self.jit_load(verbose)
  File "/home/huangbz/git_repo/DeepSpeed/deepspeed/ops/op_builder/builder.py", line 380, in jit_load
    os.makedirs(ext_path, exist_ok=True)
  File "/home/huangbz/.conda/envs/NLP/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/tmp/torch_extensions/fused_adam'
  1. I’m now testing the PR, I’ll reply to you as soon as possible.

@stas00 Sorry for the last response. I tried to tweak the model as below to reproduce the total model params of 50265.

@distributed_test(world_size=[2])
def _test_zero_to_fp32():
    class MyModel(torch.nn.Module):
        def __init__(self, hidden_dim, n_layers):
            super().__init__()
            # to reproduce https://github.com/microsoft/DeepSpeed/pull/1372 it is important that
            # the number of total elements is uneven:
            # (1) 4188 layers of 3*(3+1)=12 elements each, 50256 in total
            self.ll = torch.nn.ModuleList(
                torch.nn.Linear(hidden_dim,
                                hidden_dim) for i in range(n_layers))
            # (2) the following adds 8+1=9 elements
            self.classifier = torch.nn.Linear(8, 1)
            # total 50256 + 9 = 50265 (uneven as desired) elements
            self.cross_entropy_loss = torch.nn.CrossEntropyLoss()

        def forward(self, x, y):
            hidden = x
            for l in self.ll:
                hidden = l(hidden)
            return self.cross_entropy_loss(hidden, y)

    args = args_from_dict(tmpdir, config_dict)
    hidden_dim = 3  # do not change

    world_size = dist.get_world_size()
    # we want at least 2x layers as there are gpus to trigger round_robin_fp16_groups reshuffle in zero2
    n_layers = world_size * 2094 # total 4188 layers

And ran the test as you told. It seemed that the test failed

pyt --forked tests/unit/test_zero.py -k test_zero_to_fp32[2] -sv

tests/unit/test_zero.py::test_zero_to_fp32[2] FAILED

===================================================================================== FAILURES ======================================================================================
_______________________________________________________________________________ test_zero_to_fp32[2] ________________________________________________________________________________
Worker 0 exited with code 1
============================================================================== short test summary info ==============================================================================
FAILED tests/unit/test_zero.py::test_zero_to_fp32[2]
========================================================================= 1 failed, 10 deselected in 35.30s =========================================================================

Should I make a Pull Request ?

@stas00 Thanks for your instruction. I do the following test and hope it may help you reproduce the bug.

  1. I use the BartPretrainedModel class (which MyBart subclass), and everything works well. So I guess probably it is the way I implement MyBart that produce the error.

  2. I reduce this model to 2 layers and the error still exists.

    code of MyBart
     class MyBart(BartPretrainedModel):
         def __init__(self, config: BartConfig):
             super().__init__(config)
             self.model = BartModel(config)
             self.lm_head = nn.Linear(config.d_model, self.model.shared.num_embeddings, bias=True)
             self.si_lm_head = nn.Linear(config.d_model, self.model.shared.num_embeddings, bias=True)
             self.dg_lm_head = nn.Linear(config.d_model, self.model.shared.num_embeddings, bias=True)
             self._init_weights(self.si_lm_head)
             self._init_weights(self.dg_lm_head)
             self._init_weights(self.lm_head)
    
    output of zero_to_fp32.py
     Processing zero checkpoint './ckpt'
     Detected checkpoint of type zero stage 2, world_size: 4
     Found buffers: []
     fp32_flat_groups[i].shape=torch.Size([43170884])
     fp32_flat_groups[i].shape=torch.Size([43170884])
     fp32_flat_groups[i].shape=torch.Size([43170884])
     fp32_flat_groups[i].shape=torch.Size([43170883])
     Have 172683535 numels to process.
     Need 172683531 numels in 55 params.
     added 0 buffers
     model.shared.weight full shape: torch.Size([50265, 768]) unpartitioned numel 38603520 
     model.encoder.embed_positions.weight full shape: torch.Size([1026, 768]) unpartitioned numel 787968 
     model.encoder.layers.0.self_attn.k_proj.weight full shape: torch.Size([768, 768]) unpartitioned numel 589824 
     model.encoder.layers.0.self_attn.k_proj.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.encoder.layers.0.self_attn.v_proj.weight full shape: torch.Size([768, 768]) unpartitioned numel 589824 
     model.encoder.layers.0.self_attn.v_proj.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.encoder.layers.0.self_attn.q_proj.weight full shape: torch.Size([768, 768]) unpartitioned numel 589824 
     model.encoder.layers.0.self_attn.q_proj.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.encoder.layers.0.self_attn.out_proj.weight full shape: torch.Size([768, 768]) unpartitioned numel 589824 
     model.encoder.layers.0.self_attn.out_proj.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.encoder.layers.0.self_attn_layer_norm.weight full shape: torch.Size([768]) unpartitioned numel 768 
     model.encoder.layers.0.self_attn_layer_norm.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.encoder.layers.0.fc1.weight full shape: torch.Size([3072, 768]) unpartitioned numel 2359296 
     model.encoder.layers.0.fc1.bias full shape: torch.Size([3072]) unpartitioned numel 3072 
     model.encoder.layers.0.fc2.weight full shape: torch.Size([768, 3072]) unpartitioned numel 2359296 
     model.encoder.layers.0.fc2.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.encoder.layers.0.final_layer_norm.weight full shape: torch.Size([768]) unpartitioned numel 768 
     model.encoder.layers.0.final_layer_norm.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.encoder.layernorm_embedding.weight full shape: torch.Size([768]) unpartitioned numel 768 
     model.encoder.layernorm_embedding.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.decoder.embed_positions.weight full shape: torch.Size([1026, 768]) unpartitioned numel 787968 
     model.decoder.layers.0.self_attn.k_proj.weight full shape: torch.Size([768, 768]) unpartitioned numel 589824 
     model.decoder.layers.0.self_attn.k_proj.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.decoder.layers.0.self_attn.v_proj.weight full shape: torch.Size([768, 768]) unpartitioned numel 589824 
     model.decoder.layers.0.self_attn.v_proj.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.decoder.layers.0.self_attn.q_proj.weight full shape: torch.Size([768, 768]) unpartitioned numel 589824 
     model.decoder.layers.0.self_attn.q_proj.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.decoder.layers.0.self_attn.out_proj.weight full shape: torch.Size([768, 768]) unpartitioned numel 589824 
     model.decoder.layers.0.self_attn.out_proj.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.decoder.layers.0.self_attn_layer_norm.weight full shape: torch.Size([768]) unpartitioned numel 768 
     model.decoder.layers.0.self_attn_layer_norm.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.decoder.layers.0.encoder_attn.k_proj.weight full shape: torch.Size([768, 768]) unpartitioned numel 589824 
     model.decoder.layers.0.encoder_attn.k_proj.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.decoder.layers.0.encoder_attn.v_proj.weight full shape: torch.Size([768, 768]) unpartitioned numel 589824 
     model.decoder.layers.0.encoder_attn.v_proj.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.decoder.layers.0.encoder_attn.q_proj.weight full shape: torch.Size([768, 768]) unpartitioned numel 589824 
     model.decoder.layers.0.encoder_attn.q_proj.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.decoder.layers.0.encoder_attn.out_proj.weight full shape: torch.Size([768, 768]) unpartitioned numel 589824 
     model.decoder.layers.0.encoder_attn.out_proj.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.decoder.layers.0.encoder_attn_layer_norm.weight full shape: torch.Size([768]) unpartitioned numel 768 
     model.decoder.layers.0.encoder_attn_layer_norm.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.decoder.layers.0.fc1.weight full shape: torch.Size([3072, 768]) unpartitioned numel 2359296 
     model.decoder.layers.0.fc1.bias full shape: torch.Size([3072]) unpartitioned numel 3072 
     model.decoder.layers.0.fc2.weight full shape: torch.Size([768, 3072]) unpartitioned numel 2359296 
     model.decoder.layers.0.fc2.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.decoder.layers.0.final_layer_norm.weight full shape: torch.Size([768]) unpartitioned numel 768 
     model.decoder.layers.0.final_layer_norm.bias full shape: torch.Size([768]) unpartitioned numel 768 
     model.decoder.layernorm_embedding.weight full shape: torch.Size([768]) unpartitioned numel 768 
     model.decoder.layernorm_embedding.bias full shape: torch.Size([768]) unpartitioned numel 768 
     lm_head.weight full shape: torch.Size([50265, 768]) unpartitioned numel 38603520 
     lm_head.bias full shape: torch.Size([50265]) unpartitioned numel 50265 
     si_lm_head.weight full shape: torch.Size([50265, 768]) unpartitioned numel 38603520 
     si_lm_head.bias full shape: torch.Size([50265]) unpartitioned numel 50265 
     dg_lm_head.weight full shape: torch.Size([50265, 768]) unpartitioned numel 38603520 
     dg_lm_head.bias full shape: torch.Size([50265]) unpartitioned numel 50265 
     
     Traceback (most recent call last):
       File "zero_to_fp32.py", line 366, in <module>
         convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir, args.output_file)
       File "zero_to_fp32.py", line 304, in convert_zero_checkpoint_to_fp32_state_dict
         state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
       File "zero_to_fp32.py", line 290, in get_fp32_state_dict_from_zero_checkpoint
         return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir)
       File "zero_to_fp32.py", line 235, in _get_fp32_state_dict_from_zero_checkpoint
         f"consumed {offset} numels out of {avail_numel} - something is wrong")
     ValueError: consumed 172683532 numels out of 172683536 - something is wrong
    

@tjruwase, any chance you guys can take over the maintenance of zero_to_fp32, as I can’t keep up with the changes in the core. It’s the best that any changes you do get synced with the weights re-consolidation code in tandem.

I created the initial tests here: https://github.com/microsoft/DeepSpeed/blob/10b48405ab0e69a1caf455f445526c4b39b0dbb8/tests/unit/test_zero.py#L135

So it’s probably extending those to include the new variations that Deepspeed is extended to support.

As I suggested ealier perhaps the reconstruction code should be moved along side the partitioning code for each stage, so that they are in the same file and thus are easier to maintain in sync.

So here is how the model gets partitioned, and here is how it gets unpartitioned, one next to each other.

Thank you.