litgpt: Bug: Finetuning on multi-GPU (FSDP) does not initialize with the foundation model
When experimenting the adaptation of Falcon on multi-GPU with finetune/lora.py, we had surprisingly bad results. After investigation, we realized that we were actually training a randomly initialized model. (although only checkpointing the LoRA weights, so that model trained from scratch was just lost…).
In other words, the foundation model (Falcon) was not properly loaded.
It seems to be due to the use of fabric.init_module(empty_init=True) at this line:
https://github.com/Lightning-AI/lit-gpt/blob/bf60124fa72a56436c7d4fecc093c7fc48e84433/finetune/lora.py#L128
If we use empty_init=False it trains correctly. I am not sure it’s the right fix, though.
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 16 (5 by maintainers)
I also tried to do LoRA fine-tuning.
The latest code from main branch, the latest packages.
I used
4xA10G(the best what I have),bf16-trueprecision. Downloaded and converted the model (Falcon-7B) as described intutorials/download_falcon.md, then ran prepare script for Alpaca dataset with this model. The only changes infinetune/lora.pyare:validatecall in train function to make the log more concise.This is what I got when I tried to fine-tune:
The loss values were exactly the same for
empty_init=True,empty_init=Falseandempty_init=(device>1).Also, please share the printed output of these two lines:
I have the similar issue. If I set the init_weight = True. The loss is aroun 8 or 9. It means the model doesn’t load the checkpoint successfully. I use the torchrun to initate the program by the way. I cannot use the python3 xxx.py because my machine set up differently.