litgpt: Bug: Finetuning on multi-GPU (FSDP) does not initialize with the foundation model

When experimenting the adaptation of Falcon on multi-GPU with finetune/lora.py, we had surprisingly bad results. After investigation, we realized that we were actually training a randomly initialized model. (although only checkpointing the LoRA weights, so that model trained from scratch was just lost…).

In other words, the foundation model (Falcon) was not properly loaded. It seems to be due to the use of fabric.init_module(empty_init=True) at this line: https://github.com/Lightning-AI/lit-gpt/blob/bf60124fa72a56436c7d4fecc093c7fc48e84433/finetune/lora.py#L128 If we use empty_init=False it trains correctly. I am not sure it’s the right fix, though.

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 16 (5 by maintainers)

Most upvoted comments

I also tried to do LoRA fine-tuning.

The latest code from main branch, the latest packages.

I used 4xA10G (the best what I have), bf16-true precision. Downloaded and converted the model (Falcon-7B) as described in tutorials/download_falcon.md, then ran prepare script for Alpaca dataset with this model. The only changes in finetune/lora.py are:

  1. device=4
  2. max_iters=10
  3. commented out validate call in train function to make the log more concise.

This is what I got when I tried to fine-tune:

main ~/lit-gpt python finetune/lora.py --checkpoint_dir checkpoints/tiiuae/falcon-7b --precision bf16-true
{'eval_interval': 100, 'save_interval': 100, 'eval_iters': 100, 'eval_max_new_tokens': 100, 'log_interval': 1, 'devices': 4, 'learning_rate': 0.0003, 'batch_size': 128, 'micro_batch_size': 4, 'gradient_accumulation_iters': 32, 'max_iters': 10, 'weight_decay': 0.01, 'lora_r': 8, 'lora_alpha': 16, 'lora_dropout': 0.05, 'lora_query': True, 'lora_key': False, 'lora_value': True, 'lora_projection': False, 'lora_mlp': False, 'lora_head': False, 'warmup_steps': 100}
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

[rank: 0] Seed set to 1337
[rank: 1] Seed set to 1337
[rank: 3] Seed set to 1337
[rank: 2] Seed set to 1337
Loading model 'checkpoints/tiiuae/falcon-7b/lit_model.pth' with {'name': 'falcon-7b', 'hf_config': {'org': 'tiiuae', 'name': 'falcon-7b'}, 'block_size': 2048, 'vocab_size': 65024, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 1, 'shared_attention_norm': True, '_norm_class': 'LayerNorm', 'norm_eps': 1e-05, '_mlp_class': 'GptNeoxMLP', 'gelu_approximate': 'none', 'intermediate_size': 18176, 'rope_condense_ratio': 1, 'rope_base': 10000, 'r': 8, 'alpha': 16, 'dropout': 0.05, 'to_query': True, 'to_key': False, 'to_value': True, 'to_projection': False, 'to_mlp': False, 'to_head': False, 'head_size': 64, 'rope_n_elem': 64}
Number of trainable parameters: 3,506,176
Number of non trainable parameters: 7,217,189,760
[rank: 3] Seed set to 1340
[rank: 0] Seed set to 1337
[rank: 2] Seed set to 1339
[rank: 1] Seed set to 1338
The longest sequence length in the train data is 1079, the model's maximum sequence length is 1079 and context length is 2048
iter 1 step 0: loss 1.7293, iter time: 10214.60ms
iter 2 step 0: loss 2.5372, iter time: 5275.93ms
iter 3 step 0: loss 2.3912, iter time: 5251.75ms
iter 4 step 0: loss 2.3706, iter time: 5457.99ms
iter 5 step 0: loss 2.1239, iter time: 5294.34ms
iter 6 step 0: loss 2.3765, iter time: 5302.96ms
iter 7 step 0: loss 2.0163, iter time: 5307.21ms
iter 8 step 0: loss 1.8228, iter time: 5372.66ms
iter 9 step 0: loss 2.7029, iter time: 5237.35ms
iter 10 step 0: loss 2.1403, iter time: 5389.18ms
Training time: 58.26s
Memory used: 20.53 GB
Saving LoRA weights to 'out/lora/alpaca/lit_model_lora_finetuned.pth'

The loss values were exactly the same for empty_init=True, empty_init=False and empty_init=(device>1).

I have the similar issue. If I set the init_weight = True. The loss is aroun 8 or 9. It means the model doesn’t load the checkpoint successfully. I use the torchrun to initate the program by the way. I cannot use the python3 xxx.py because my machine set up differently.