gisting: Unable to reproduce LLaMA-7B results when training from scratch

Hi,

I was trying to reproduce the LLaMA-7B with 1 gist token results from scratch following the training instruction in the README. I ran the script below on 4 A100-80GB GPUs:

TAG="train80g"

port=$(shuf -i25000-30000 -n1)

deepspeed --master_port $port --num_gpus=4 --no_local_rank \
    --module src.train \
    +model=llama-7b wandb.tag=$TAG \
    training.deepspeed=ds_configs/stage3.json \
    training.gist.condition=gist \
    training.gist.num_gist_tokens=1

However, the final results after 3 epochs are much lower than the reported ones in the paper. I got seen 51.24, unseen 42.01, human 19.00 for ROUGE-L. I tried training for longer epochs but it didn’t help with unseen and human ROUGE-L results. I did not change anything in the training config other than the wandb account.

I also evaluated the 3 provided checkpoints (gist, pos_control, neg_control) and the results are consistent with the paper (< 0.1 difference in terms of ROUGE-L) for all of them, so the evaluation code should function normally. Could you help double check if the above training setup is correct, and do you have any suggestions on how to reproduce LLaMA results in the paper?

About this issue

Original URL
State: closed
Created a year ago
Comments: 23 (9 by maintainers)

Most upvoted comments

Hi, @jayelm I don’t have further questions and I appreciate the great idea the great code, from which I learned a lot!

Hannibal046 on Nov 29, 2023

Apologies for the late response, I’ve been on vacation 😄

Glad to hear the results replicate, and thanks so much for looking so closely into this! This will help a lot with reproducibility.

Super weird that the DeepSpeed version leads to such drastic performance differences. I’ll make a note of this in the repo when I get back from vacation.

jayelm on Jul 3, 2023

Is there anything else that needs to be changed than setting training.gist.condition=pos_control to train the positive control model using the script above?

No, that is the only change required!

I noticed that although the config specifies a cosine lr scheduler, the logged wandb lr was always unchanged. I am not sure if this is a wandb issue (as I have encountered similar logging issues in the past) or the lr scheduler was truly not working properly.

Aha, this was also the case in the original Alpaca codebase—it specified a cosine LR scheduler but the LR wasn’t actually changed at least according to wandb. I didn’t look too closely into this. So even if not entirely correct, this is expected, and I observed this in my experiments.

jayelm on Jun 16, 2023