transformers: i got a Trainer error: Attempting to unscale FP16 gradients
System Info
transformers
version: 4.28.1- Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.27
- Python version: 3.9.16
- Huggingface_hub version: 0.13.4
- Safetensors version: not installed
- PyTorch version (GPU?): 1.13.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
- device : Tesla T4*4
- CUDA-11.6
Who can help?
@sgugger Now, when I add fp16=True, i get the error: ValueError: Attempting to unscale FP16 gradients. when running trainer.train()
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
from transformers import LlamaTokenizer, LlamaForCausalLM,AutoTokenizer,AutoModelForSeq2SeqLM, LlamaConfig
from peft import prepare_model_for_int8_training, LoraConfig, get_peft_model, get_peft_model_state_dict merge_tokenizer = LlamaTokenizer.from_pretrained(‘/home/han/new_store/Llama/merged_tokenizer_hf’,padding=True, truncation=True) print(len(merge_tokenizer)) n = merge_tokenizer.add_special_tokens({‘pad_token’: ‘[PAD]’}) len(merge_tokenizer)
from datasets import load_dataset dataset = load_dataset(“json”, data_files=“./data/alpaca_data_zh_51k.json”)
dataset = dataset.filter(lambda x: x[“output”]!=None) dataset = dataset.filter(lambda x: x[“input”] !=None)
def preprocess_function(sample):
l = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.</s>Human:"
for i in range(len(sample['instruction'])):
if sample['input'][i]!='':
sample['instruction'][i]=l+sample['instruction'][i]+'[PAD]'+sample['input'][i]
# print(sample['input'][i])
output = ['Assistant:'+i for i in sample['output']]
model_inputs = merge_tokenizer(sample['instruction'], truncation=True,padding=True,max_length=200)
labels = merge_tokenizer(output, truncation=True, padding=True,max_length=200)
model_inputs["labels"] = labels["input_ids"]
# print(model_inputs)
return model_inputs
input_data = dataset[‘train’].map(preprocess_function,batched=True,remove_columns=[‘instruction’,‘input’,‘output’])
import torch
model = LlamaForCausalLM.from_pretrained(‘decapoda-research/llama-7b-hf’,device_map=‘auto’,cache_dir=‘./cache/’,torch_dtype=torch.float16) model.resize_token_embeddings(len(merge_tokenizer))
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling trainArgs = TrainingArguments( output_dir= ‘…/ckps_emb’, do_train=True, # per_device_train_batch_size=4, auto_find_batch_size=True, fp16=True, gradient_accumulation_steps=4, evaluation_strategy=“steps”, save_strategy=“steps”, save_steps=1000, eval_steps=1000, logging_steps=20, warmup_steps=100, num_train_epochs=2, learning_rate=5e-4, load_best_model_at_end=True,
report_to="wandb"
)
for name, param in model.named_parameters(): param.requires_grad_(False) if name ==‘model.embed_tokens.weight’: param.requires_grad_(True) print(name, “requires_grad:”, param.requires_grad)
trainer = Trainer( model=model, args=trainArgs, train_dataset=input_data, eval_dataset=input_data, data_collator=DataCollatorForLanguageModeling(merge_tokenizer, mlm=False), ) model.config.use_cache = True trainer.train() model.save_pretrained(‘…/ckps/demo_llama71_full’)
Expected behavior
i except it does not give a error ,ValueError:Attempting to unscale FP16 gradients.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (6 by maintainers)
You can’t train a model loaded in FP16:
is the culprit here. I don’t know how PEFT initializes the layer to train afterwards, but some of them must be in the same dtype cc @younesbelkada
Could you explain what you mean by cannot train a fp16 model? Is it because you would need a fp32 copy of weights for fp16 mixed precision training?
I second what @sgugger said, however I see that you’re importing peft but doing nothing with it, also make sure to use the latest
peft
release as it contains some bug fixes.In my opinion, to use PEFT at its best, you should load your model in 8bit as follows:
Also make sure to use
transformers
latest release as well:Hi @younesbelkada , thanks again for your quick response.
I actually have implemented a lot of your example codes from the Peft lib already. Also the
load_in_8bit
support backed by bnb is really impressive, and I’ve used it for zero-/ few-shot inference with LLM on a single 4090. For training, I have implemented almost every factors that were mention in Efficient Training on a Single GPU by using the HF trainer. However, the largest model that I can tune in full precision is flan-t5-3B with very efficient setup and a new GPU-friendly optimizer called Lion, but in 8bit version.Personally I am very excited about efficient fine-tuning techniques such as Lora, and I have carefully examined the code for AdaLoRA and a newer technique called Ladder Side-Tuning (LST), and I have asked the authors if they intend to integrate this technique into the peft library. However, the reason I have been on the fence for the last two weeks with regard to peft techniques such as lora is that there is a growing number of papers appearing which fine-tune models using peft techniques based on some very new auto-regressive models. An increasing number of studies show that lora seems to have significant robustness problems for training of domain-specific (medical) and other language (Chinese) instructions. In these papers, lora lags behind full fine-tuning almost across the board in all metrics. Certainly I agree with your analysis of the causes above, and I am not in a hurry to draw conclusions about the results from these papers, as new technologies need to be viewed rationally.
But I wonder if I could open a new issue in the peft repository to follow up on the current new research on peft/lora and see if I could find a reasonable explanation for the difference in performance across different fine-tuning techniques by documenting and analysing similar papers over time and get more developers involved in the discussion?
Regards, Wang