peft: Very Slow Inference on PEFT-LORA fine-tunned FLAN-UL2

Thanks to PEFT-LORA I was able to fine-tune a 20B FLAN-UL2 model.

I’m running inference on 3x v100 GPUs with full precision (not bf16 or fp16)

When I use model.generate() with the PEFT-Model it is about 10 times slower.

BASE_MODEL = google/flan-ul2
model =  T5ForConditionalGeneration.from_pretrained(BASE_MODEL, device_map="auto")

FINE_TUNED_MODEL_PEFT = 'finetunnedmodel-peft-path/' # contains adapter_model.bin, adapter_config.json
model = PeftModel.from_pretrained(model, FINE_TUNED_MODEL_PEFT, device_map="auto")

I generate the model outputs for both using:

output= model.generate(input_id, max_new_tokens= 256)

Is it expected to be that much slower? Are there methods for me to improve the inference speed of LORA models?

Thank you

Edit:

adapter_config.json

{ “base_model_name_or_path”: “google/flan-ul2”, “bias”: “none”, “enable_lora”: null, “fan_in_fan_out”: false, “inference_mode”: true, “lora_alpha”: 32, “lora_dropout”: 0.1, “merge_weights”: false, “modules_to_save”: null, “peft_type”: “LORA”, “r”: 8, “target_modules”: [ “q”, “v” ], “task_type”: “SEQ_2_SEQ_LM” }

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 22

Most upvoted comments

FYI to anyone looking at this, #227 added a merge_and_unload function so the above code isn’t necessary anymore.

model = model.merge_and_unload() should be all you need now to merge the model.

Hi @younesbelkada ! Thanks for the script! After merging, there is no slowdown in inference!

@johnrobinsn I’m planning on releasing the script soon. I need to clean it up a bit before I release it 😃. I based it on https://github.com/huggingface/peft/blob/main/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py

Hi @younesbelkada , I’ve finetuned flan-ul2 model using LoRA ( on 8x v100 GPUs with full precision (not bf16 or fp16)) and then merge the Lora weights into the Flan-UL2 model following the approach in merge_peft_adapter.py. Following that I tried to do 8bit inference as follows:

merged_model ='directory where merged model is stored'
model = AutoModelForSeq2SeqLM.from_pretrained(merged_model,load_in_8bit=True,device_map='auto')
tokenizer= AutoTokenizer.from_pretrained('google/flan-ul2)

prompt_template = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"
prompt = "Draft me an introduction section of a medium article on the topic 'Efficient Fine-tuning of UL-2 and T5 Models Using LoRA on Limited Compute"

input = prompt_template.format(instruction=prompt)
input_ids = tokenizer(input, return_tensors="pt", truncation=True).input_ids.cuda()
outputs = model.generate(input_ids=input_ids,max_new_tokens=128)

Unfortunately, this leads to r: probability tensor contains either inf, nan or element < 0 error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/lora_training/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/lora_training/lib/python3.8/site-packages/transformers/generation/utils.py", line 1452, in generate
    return self.sample(
  File "/home/ubuntu/miniconda3/envs/lora_training/lib/python3.8/site-packages/transformers/generation/utils.py", line 2504, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Any idea, what would be the reason for this? (Note:I could run inference with Float32 precision both on CPU and GPU).

Hi @dhairyadalal In this case you should load the base model in float16, merge the LoRA weights, save the new model somewhere and load it back in 8bit

So does that mean that if i want to eval every epoch, i would have to merge the lora adapter and then run the model.generate at every epoch?

I have the same concern. But I don’t think merge_and_unload after each epoch would be the solution as it seems quite expensive.

What I’m doing is to evaluate the loss (but not generation) only during training.

Hi @dhairyadalal This is totally correct yes For your last question, you should load your model with

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)

This is the canonical way to load models in half precision using transformers

Awesome thanks so much @TejaGollapudi 👏

Awesome article @TejaGollapudi ! Thanks for bringing this up !

@johnrobinsn @rsilveira79 Sorry for the delay 😅.

I put the code together at Efficient Instruction Fine-tuning of Flan-UL2 (20B LLM) Using LoRA with Limited Compute

I partition the code into GitHub gist files and put them into that article. Hopefully, it’s useful. I wanted to get the code out sooner in a GitHub Repo but that would take me a lot longer to get the clearance.

Please feel free to reach out if you have any issues with the code. Thanks!