transformers: Qlora on open llama 13b fails
System Info
Installed by !pip install -q -U git+https://github.com/huggingface/transformers.git
On databricks
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
import transformers
trainer = transformers.Trainer(
model=peft_model,
train_dataset=data["train"],
args=transformers.TrainingArguments(
save_steps=250,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
num_train_epochs=5,
# max_steps=5,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir=models[model_name]['folder_name'],
optim="paged_adamw_8bit"
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
File <command-412498178049036>:21
3 trainer = transformers.Trainer(
4 model=peft_model,
5 train_dataset=data["train"],
(...)
18 data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
19 )
20 model.config.use_cache = False # silence the warnings. Please re-enable for inference!
---> 21 trainer.train()
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/transformers/trainer.py:1537, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1532 self.model_wrapped = self.model
1534 inner_training_loop = find_executable_batch_size(
1535 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
1536 )
-> 1537 return inner_training_loop(
1538 args=args,
1539 resume_from_checkpoint=resume_from_checkpoint,
1540 trial=trial,
1541 ignore_keys_for_eval=ignore_keys_for_eval,
1542 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/transformers/trainer.py:1860, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1855 nn.utils.clip_grad_norm_(
1856 amp.master_params(self.optimizer),
1857 args.max_grad_norm,
1858 )
1859 else:
-> 1860 self.accelerator.clip_grad_norm_(
1861 model.parameters(),
1862 args.max_grad_norm,
1863 )
1865 # Optimizer step
1866 optimizer_was_run = True
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/accelerate/accelerator.py:1908, in Accelerator.clip_grad_norm_(self, parameters, max_norm, norm_type)
1904 elif self.distributed_type == DistributedType.DEEPSPEED:
1905 # `accelerator.backward(loss)` is doing that automatically. Therefore, its implementation is not needed
1906 # We cannot return the gradient norm because DeepSpeed does it.
1907 return None
-> 1908 self.unscale_gradients()
1909 return torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/accelerate/accelerator.py:1871, in Accelerator.unscale_gradients(self, optimizer)
1869 while isinstance(opt, AcceleratedOptimizer):
1870 opt = opt.optimizer
-> 1871 self.scaler.unscale_(opt)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:275, in GradScaler.unscale_(self, optimizer)
272 optimizer_state = self._per_optimizer_states[id(optimizer)]
274 if optimizer_state["stage"] is OptState.UNSCALED:
--> 275 raise RuntimeError("unscale_() has already been called on this optimizer since the last update().")
276 elif optimizer_state["stage"] is OptState.STEPPED:
277 raise RuntimeError("unscale_() is being called after step().")
RuntimeError: unscale_() has already been called on this optimizer since the last update().
Interestingly failed at exactly 1 Epoch
Expected behavior
Run normally?
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 18 (1 by maintainers)
I think this works. Haven’t tested though. Will close for now
Yeah, I used your PeftSavingCallback below and added it to the callbacks param in the Trainer. It created the adapter_config and adapter_model and saved them into the
checkpoint-XXXfolder after every save step, which I set to 100. I am using Colab so I downloaded the adapter_model and config to my local computer, then uploaded it to Hugging Face as a LoRA adapter using the Upload files button on the model repo.