gpt-neox: Problems on generating with llama model

Hi, I tried loading the llama model for inference and encountered some problems. I use 4 v100 GPUs with the model parallel size of 4 to load the llama 7b checkpoints.

  1. Error in loading the llama checkpoint. The converted checkpoint generated by the script tools/convert_raw_llama_weights_to_neox.py provides no optimizer states. but the deepspeed keep trying to load the optimizer state even if I set the finetune flag to true. The cause seems to be in here. Regardless of whether the file exists, deepspeed will return the file list of optimizer states. I fix this by adding an additional line to check if the file exists and returning None if not.

  2. Tensor shape mismatch occurred during inference. This is fixed by changing the line here, where

attention_mask = attention_mask[
                    ..., : attention_scores.size(3), : attention_scores.size(3)
                ]

is change to

attention_mask = attention_mask[
                    ..., : attention_scores.size(2), : attention_scores.size(3)
                ]

I wonder if my fixes are correct, or if there are better ways to fix this. I think I just tackling the phenomenon of the problem but not the causes of it.

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 21 (3 by maintainers)

Most upvoted comments

So it sounds like this issue is a combination of two other issues:

  1. Our recurring issues with generation in GPT-NeoX
  2. The fact that we don’t currently support the SPM Tokenizer.

If that’s the case, I think it probably makes sense to close this issue as both of those are known problems we are currently working on.

Keep this issue open until these issues are resolved. We’ll add a “Fixes xxx” clause that auto-closes this issue to whatever PR fixes things.

Thanks DaoD. This project has already converted the ckpt into Megatron/GPT-NeoX format. I’m curious about how you used HF for validation.

I do not use the inference code of GPT-NeoX, and it seems that there are some minor problems. I have just tested the HF version. It works well.

I just use the neox format for training/fine-tuning. After training, I convert it into HF version for inference/testing.