transformers: Qlora on open llama 13b fails

System Info

Installed by !pip install -q -U git+https://github.com/huggingface/transformers.git On databricks

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

import transformers

trainer = transformers.Trainer(
    model=peft_model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        save_steps=250,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        num_train_epochs=5,
        # max_steps=5,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir=models[model_name]['folder_name'],
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File <command-412498178049036>:21
      3 trainer = transformers.Trainer(
      4     model=peft_model,
      5     train_dataset=data["train"],
   (...)
     18     data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
     19 )
     20 model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
---> 21 trainer.train()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/transformers/trainer.py:1537, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1532     self.model_wrapped = self.model
   1534 inner_training_loop = find_executable_batch_size(
   1535     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1536 )
-> 1537 return inner_training_loop(
   1538     args=args,
   1539     resume_from_checkpoint=resume_from_checkpoint,
   1540     trial=trial,
   1541     ignore_keys_for_eval=ignore_keys_for_eval,
   1542 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/transformers/trainer.py:1860, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1855         nn.utils.clip_grad_norm_(
   1856             amp.master_params(self.optimizer),
   1857             args.max_grad_norm,
   1858         )
   1859     else:
-> 1860         self.accelerator.clip_grad_norm_(
   1861             model.parameters(),
   1862             args.max_grad_norm,
   1863         )
   1865 # Optimizer step
   1866 optimizer_was_run = True

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/accelerate/accelerator.py:1908, in Accelerator.clip_grad_norm_(self, parameters, max_norm, norm_type)
   1904 elif self.distributed_type == DistributedType.DEEPSPEED:
   1905     # `accelerator.backward(loss)` is doing that automatically. Therefore, its implementation is not needed
   1906     # We cannot return the gradient norm because DeepSpeed does it.
   1907     return None
-> 1908 self.unscale_gradients()
   1909 return torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/accelerate/accelerator.py:1871, in Accelerator.unscale_gradients(self, optimizer)
   1869 while isinstance(opt, AcceleratedOptimizer):
   1870     opt = opt.optimizer
-> 1871 self.scaler.unscale_(opt)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35a3008b-a999-41db-a8be-1e0597d78a6b/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py:275, in GradScaler.unscale_(self, optimizer)
    272 optimizer_state = self._per_optimizer_states[id(optimizer)]
    274 if optimizer_state["stage"] is OptState.UNSCALED:
--> 275     raise RuntimeError("unscale_() has already been called on this optimizer since the last update().")
    276 elif optimizer_state["stage"] is OptState.STEPPED:
    277     raise RuntimeError("unscale_() is being called after step().")

RuntimeError: unscale_() has already been called on this optimizer since the last update().

Interestingly failed at exactly 1 Epoch

Expected behavior

Run normally?

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 18 (1 by maintainers)

Most upvoted comments

I think this works. Haven’t tested though. Will close for now

@richardr1126 are your checkpoints saving properly? I had to write a custom call back as the adapter_config wasn’t being written

Yeah, I used your PeftSavingCallback below and added it to the callbacks param in the Trainer. It created the adapter_config and adapter_model and saved them into the checkpoint-XXX folder after every save step, which I set to 100. I am using Colab so I downloaded the adapter_model and config to my local computer, then uploaded it to Hugging Face as a LoRA adapter using the Upload files button on the model repo.

from trl import SFTTrainer
from transformers import TrainerCallback
import os

class PeftSavingCallback(TrainerCallback):
    def on_save(self, args, state, control, **kwargs):
        checkpoint_path = os.path.join(args.output_dir, f"checkpoint-{state.global_step}")
        kwargs["model"].save_pretrained(checkpoint_path)

        if "pytorch_model.bin" in os.listdir(checkpoint_path):
            os.remove(os.path.join(checkpoint_path, "pytorch_model.bin"))

trainer = SFTTrainer(
    model=model,
    train_dataset=sql,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=176,
    tokenizer=tokenizer,
    args=training_arguments,
    callbacks=[PeftSavingCallback]
)