unsloth: Gemma 2B LoRA merging is broken
I’ve trained Gemma 2B in 16bit with LoRA. With adapters loaded separately everything works just fine. But after merging the adapters, the model becomes literally unusable.
On the screenshot:
model
is PEFT model with adapters as is.model2
is model with adapters merged.
Here is the code used to load the models:
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "trained", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = 8192,
dtype = None,
load_in_4bit = False,
resize_model_vocab=256001
)
tokenizer.add_special_tokens({'additional_special_tokens': ['<|im_start|>']})
tokenizer = get_chat_template(
tokenizer,
chat_template="chatml",
map_eos_token=True
)
FastLanguageModel.for_inference(model)
model.save_pretrained_merged("merged", tokenizer, save_method = "merged_16bit")
model2, tokenizer = FastLanguageModel.from_pretrained(
model_name = "merged", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = 8192,
dtype = None,
load_in_4bit = False,
#resize_model_vocab=32001
)
FastLanguageModel.for_inference(model2)
Model was trained with ChatML format, hence token adding stuff.
resize_model_vocab
parameter is a workaround I added to load vocab of different size.
Also, saved adapters weigh 6 GB, is that alright? Note that merged model is 5.7 GB. I believe adapters should be hundred MB tops (maybe a GB with saved vocab and lm_head), but presumably the whole model got saved.
Important note: during training in modules_to_save
param I passed ["embed_tokens", "lm_head"]
to train the new ChatML tokens. Although I am not sure how that plays with the fact the Gemma’s embed_tokens
weights are tied with lm_head
(I believe?). Maybe that’s actually the reason why merging fails? Like you have to pass in only embed_tokens
, otherwise everything will break (just a hypothesis)
Dependencies:
accelerate 0.27.2
datasets 2.16.1
huggingface-hub 0.20.3
ipykernel 6.29.0
ipython 8.15.0
jedi 0.18.1
Jinja2 3.1.2
numpy 1.26.2
peft 0.8.2
safetensors 0.4.2
scikit-learn 1.4.0
scipy 1.12.0
sentence-transformers 2.3.1
sentencepiece 0.1.99
sympy 1.12
tensorboardX 2.6.2.2
torch 2.1.2
torchaudio 2.1.2
torchelastic 0.2.2
torchvision 0.16.2
transformers 4.38.1
triton 2.1.0
trl 0.7.11
unsloth 2024.3
xformers 0.0.23.post1
About this issue
- Original URL
- State: open
- Created 4 months ago
- Comments: 15 (5 by maintainers)
Ok I finally fixed it! I took your advice and rewrote the kernels and isolated it out. Hopefully GGUF saving works now (and merged 16bit)
Can confirm that merging in 16bit now works fine. No more degenerate outputs.
Guess the difference in responses we can attribute to rounding errors during LoRA merge (I’ve seen it with other models as well), I’m good with that.
Thanks for the fix, well done!
@oKatanaaa Oh my - thanks so so much for all the debugging - extremely appreciate it!! I just woke up so much apologies missed the convo - I was gonna say it’s ironic I was fixing Gemma bugs but didn’t check Unsloth’s own issues!! 😆
Great you found the +1 culprit - I actually totally forgot to minus 1 during merging - but if according to your analysis +1 then minus 1 reduces accuracy, I’ll just copy paste the kernel and add 1 - i’ll do that in minutes and push it in 😃
On the saving modules - interesting - I have never interacted with saving modules since I normally only finetune the rest and leave the lm_head and embedding matrix alone. I shall investigate this later today!!
Again thanks so much on the help - extremely appreciate it! I’ll at you in the fix 😃
Alright, I added this stupid ass fix (in
unsloth_save_model
) and now everything works fine:Although the outputs are not exactly the same, that’s way better than before:
Hope that helps 🫡