gisting: Unable to reproduce LLaMA-7B results when training from scratch
Hi,
I was trying to reproduce the LLaMA-7B with 1 gist token results from scratch following the training instruction in the README. I ran the script below on 4 A100-80GB GPUs:
TAG="train80g"
port=$(shuf -i25000-30000 -n1)
deepspeed --master_port $port --num_gpus=4 --no_local_rank \
--module src.train \
+model=llama-7b wandb.tag=$TAG \
training.deepspeed=ds_configs/stage3.json \
training.gist.condition=gist \
training.gist.num_gist_tokens=1
However, the final results after 3 epochs are much lower than the reported ones in the paper. I got seen 51.24, unseen 42.01, human 19.00 for ROUGE-L. I tried training for longer epochs but it didn’t help with unseen and human ROUGE-L results. I did not change anything in the training config other than the wandb account.
I also evaluated the 3 provided checkpoints (gist, pos_control, neg_control) and the results are consistent with the paper (< 0.1 difference in terms of ROUGE-L) for all of them, so the evaluation code should function normally. Could you help double check if the above training setup is correct, and do you have any suggestions on how to reproduce LLaMA results in the paper?
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 23 (9 by maintainers)
Hi, @jayelm I don’t have further questions and I appreciate the great idea the great code, from which I learned a lot!
Apologies for the late response, I’ve been on vacation 😄
Glad to hear the results replicate, and thanks so much for looking so closely into this! This will help a lot with reproducibility.
Super weird that the DeepSpeed version leads to such drastic performance differences. I’ll make a note of this in the repo when I get back from vacation.
No, that is the only change required!
Aha, this was also the case in the original Alpaca codebase—it specified a cosine LR scheduler but the LR wasn’t actually changed at least according to wandb. I didn’t look too closely into this. So even if not entirely correct, this is expected, and I observed this in my experiments.