trl: Mixtral sft error
I am following Fine-tuning with 🤗 TRL and run with:
accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml --num_processes=4 \
examples/scripts/sft.py \
--model_name /nas/lili/models_hf/Mixtral-8x7B-Instruct-v0.1 \
--dataset_name trl-lib/ultrachat_200k_chatml \
--batch_size 2 \
--gradient_accumulation_steps 1 \
--learning_rate 2e-4 \
--save_steps 200_000 \
--use_peft \
--peft_lora_r 16 --peft_lora_alpha 32 \
--target_modules q_proj k_proj v_proj o_proj \
--load_in_4bit \
--output output \
--use_auth_token false
It throws:
Traceback (most recent call last):
File "/nas/lili/codes/pt/ft/trl/examples/scripts/sft.py", line 158, in <module>
trainer.train()
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/trl/trainer/sft_trainer.py", line 315, in train
output = super().train(*args, **kwargs)
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/trainer.py", line 1821, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/accelerate/data_loader.py", line 448, in __iter__
current_batch = next(dataloader_iter)
File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/ubuntu/miniconda3/envs/torchshare/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/data/data_collator.py", line 45, in __call__
return self.torch_call(features)
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/data/data_collator.py", line 732, in torch_call
batch = self.tokenizer.pad(examples, return_tensors="pt", pad_to_multiple_of=self.pad_to_multiple_of)
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 3259, in pad
padding_strategy, _, max_length, _ = self._get_padding_truncation_strategies(
File "/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2707, in _get_padding_truncation_strategies
raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.
Besides this error, there are warning like:
/home/ubuntu/.local/share/virtualenvs/ft-Zgps2Kz_/lib/python3.9/site-packages/trl/trainer/sft_trainer.py:282: UserWarning: You passed a token
izer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when tra
ining a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.
my env:
torch 2.1.2
transformers 4.36.2
trl 0.7.8.dev0
accelerate 0.25.0
peft 0.7.1
bitsandbytes 0.41.3.post2
About this issue
- Original URL
- State: closed
- Created 6 months ago
- Comments: 18 (11 by maintainers)
@fancyerii - this is expected no? If you use 8GPUs then the training time gets correctly split across all GPUs no?
can you pass
{"use_reentrant": False}intogradient_checkpointing_kwargsinTrainingArguments?