transformers: [Deepspeed ZeRO-3] Broken model save on fresh Transformers branch

I have my own model, which utilize two T5 encoders, and I train it via DeepSpeed. It has it’s own save_pretrained() and from_pretrained() methods, which makes a custom load/save logic: https://github.com/exelents/try_t5_siamese/blob/4140194978ac113c45e7370f40b3d9b932d0b35b/siamese_model.py#L80

When I run training and trainer starts to save checkpoint, there are going something strange: weights file for every saved encoder is going to be e few kilobytes - weights are not going to be saved. On the start of training trainer tries to load checkpoint using model.load_checkpoint(), but it seems this function has it’s own loading logic, because it cannot exec my load model logic and throws an error: ValueError: [deepspeed] failed to resume from checkpoint ./templates/siamese-t5-small-v1_1-template I can comment this code, which loads checkpoint, but then I got described before problem with saving checkpoint…

What should I do to make save my own custom model properly? It worked a month ago, but today I refreshed my Transformers repo and everything has broken.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 23 (12 by maintainers)

Most upvoted comments

@stas00 thanks! My problem is solved for now since I’m also using fp16 during fine-tuning so the current stage2 saves are good enough for me.

@samsontmr, would you kindly open a separate issue since while this is related the use-case is quite different. Please tag me and we will work on solving your use case there. Thank you!

p.s. also when you test please make sure you are using the transformers and deepspeeed master since there are constant fixes merged into it.

As for me: I fixed my problem with unnessesary checkpoint load, where I get load error, but it still has an save error on DeepSpeed stage 3 mode. If you @stas00 could help me, I would appreciate.

Here is steps to reproduce my error with model save:

  • Clone this repo: https://github.com/exelents/try_t5_siamese

  • Extract folder “qasc” from this archive: https://drive.google.com/file/d/1gwvFiPzWW0JLr0XLS25PuG2Br5S4fPbR/view?usp=sharing

  • Go to clonned repo folder and run ./create-siamese-template.sh - it will create siamese NN from two t5-small encoders in folder ./templates/siamese-t5-small-template

  • then you can run ./run-siamese-small.sh - you will see normal behaviour, in folder ./siamese_train_deepspeed/output_dir/ you will find there will be stored checkpoints every 3 steps? and you will can see a sight that weights are stored: weights files like ./siamese_train_deepspeed/output_dir/checkpoint-6/left/pytorch_model.bin will have size around hundred megabytes.

  • Then to see a problem open ./run-siamese-small.sh and change “ds_config.json” to “ds_config_stage3.json” and rerun training. You will see that weights files, like ./siamese_train_deepspeed/output_dir/checkpoint-6/left/pytorch_model.bin will have size for a few kilobytes, and you couldn’t load model from that checkpoint. There is a probleb, and it appears only if I turn on “stage 3” mode in config.