DeepSpeed: zero_to_fp32.py still imports a wrong model after fix
Same as #1165; the issue is closed now so I’m opening a new one. It looks like the zero_to_fp32.py still cannot load the correct weights under multi-gpu and ZERO-2 setting.
Loading ‘mp_rank_00_model_states.pt’:
states[‘module’][‘deberta.encoder.layer.0.output.dense.weight’] tensor([[-0.0211, 0.0068, 0.0206, …, 0.0057, 0.0316, 0.0256], [ 0.0273, 0.0141, 0.0118, …, -0.0122, 0.0054, 0.0010], [ 0.0479, -0.0237, -0.0604, …, -0.0340, -0.0183, 0.0691], …, [ 0.0270, -0.0231, 0.0218, …, 0.0563, 0.0641, -0.0094], [-0.0563, -0.0837, -0.0427, …, 0.0242, -0.0132, -0.0512], [-0.0012, 0.0064, 0.0465, …, 0.0219, 0.0259, -0.0281]], device=‘cuda:0’, dtype=torch.float16)
Loading the exported weights using load_state_dict_from_zero_checkpoint
:
(Pdb) self.deberta.encoder.layer[0].output.dense.weight
Parameter containing:
tensor([[ 0.0207, -0.0448, 0.0022, …, 0.0406, -0.0338, -0.0174],
[-0.0577, -0.0648, 0.0404, …, 0.0108, -0.0167, -0.0100],
[ 0.0548, 0.0063, 0.0024, …, 0.0311, 0.0249, 0.0167],
…,
[-0.0081, 0.0194, -0.0266, …, -0.0269, -0.0002, 0.0257],
[ 0.0202, -0.0002, 0.0831, …, -0.0008, -0.0094, 0.0258],
[-0.0320, 0.0529, -0.0259, …, 0.0117, -0.0292, -0.0064]],
I’m using Deepspeed 0.4.5. I can confirm the problem happens on 2gpus, and not on 1gpu.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 29 (26 by maintainers)
@stas00 I’m sorry that I may have made a mistake.
-sv
args doesn’t print any traceback information, so at that time I didn’t realize what caused the failure. Now I try the-v
args and print the traceback message as follow. It seems related to some permission error instead of the DeepSpeed’s bug.@stas00 Sorry for the last response. I tried to tweak the model as below to reproduce the total model params of 50265.
And ran the test as you told. It seemed that the test failed
Should I make a Pull Request ?
@stas00 Thanks for your instruction. I do the following test and hope it may help you reproduce the bug.
I use the
BartPretrainedModel
class (whichMyBart
subclass), and everything works well. So I guess probably it is the way I implementMyBart
that produce the error.I reduce this model to 2 layers and the error still exists.
code of
MyBart
output of
zero_to_fp32.py
@tjruwase, any chance you guys can take over the maintenance of zero_to_fp32, as I can’t keep up with the changes in the core. It’s the best that any changes you do get synced with the weights re-consolidation code in tandem.
I created the initial tests here: https://github.com/microsoft/DeepSpeed/blob/10b48405ab0e69a1caf455f445526c4b39b0dbb8/tests/unit/test_zero.py#L135
So it’s probably extending those to include the new variations that Deepspeed is extended to support.
As I suggested ealier perhaps the reconstruction code should be moved along side the partitioning code for each stage, so that they are in the same file and thus are easier to maintain in sync.
So here is how the model gets partitioned, and here is how it gets unpartitioned, one next to each other.
Thank you.